compared with
Current by Sven Schlarb
on Jun 23, 2014 14:01.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (65)

View Page History







h2. Platform



h2. Purpose of this experiment

The purpose of this experiment is to evaluate the performance of a scalable workflow for migrating TIFF images to images in the JPEG2000 format compared to an equivalent Taverna version of the workflow processing the data sequentially.

h2. Evaluation method

A Taverna workflow for sequential processing serves as a reference point for the large-scale execution. Out of the full Austrian National Library Tresor Music Collection data set subsets of increasing size are selected by a random process.

h2. Purpose of this experiment
The following bash statement is used to create a random sample from the full data set:

The purpose of this experiment is to evaluate the performance of a scalable workflow for migrating TIFF images to images in the JPEG2000 format compared to an equivalent Taverna version of the workflow processing the data sequentially.
{code}
find . -type f -exec ls -l -sd {} + | grep ".tif" | \
awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' | \
sort -n |  awk '{print $10 "\t" $7}' | head -$NUM > ~/tresormusicfilepaths${NUM}_withsize.csv
{code}
The statement prepends a random number to the file paths list and orders the list subsequently. Variable NUM is the desired size of the data set. The resulting file contains the local file paths and can be used as input for the Taverna workflow presented in the next section.

Additionally the files are uploaded to HDFS as input for the large-scale workflow execution.

By that way it is possible to compare the sequential execution time to the large-scale processing time.

h2. Taverna workflow - sequential processing
!TavernaWorkflow4276.png|border=1,width=235,height=786!

_Figure 1 (above): Taverna workflow_


Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at [http://www.myexperiment.org/workflows/4276.html]
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG ([https://code.google.com/p/openjpeg|https://code.google.com/p/openjpeg]).
The sequential execution of this workflow is used as a reference point for measuring the parallelisation efficiency of the scalable version and it allows measuring how the processing times of the different components compare to each other.

The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the [Austrian National Library Tresor Music Collection|../../../../../../../../../../display/SP/Austrian+National+Library+Tresor+Music+Collection]: !distribution_execution_times.PNG|border=1,width=473,height=264! Collection:

 
!distribution_execution_times.PNG|border=1,width=473,height=264!

_Figure 2 (above): execution times of each of the workflows’ steps_

In the design phase this analysis is used to examine the average execution times for the individual tools. As a consequence of this experiment we might conclude, that over 4 seconds for the the FITS-based TIF image validation takes too much time and that this processing step needs to be improved, while the Jpylyzer validation is acceptable taking only slightly more than 1 second per image file in average.


h2. SCAPE Platform workflow - distributed processing

[Apache Pig|http://pig.apache.org/] was used to create a scalable version of this workflow. The different processing steps of the Taverna workflow for sequential processing are represented by Pig Latin statements.

The comments of each processing step In the script below indicate which is the corresponding processing component in the Taverna workflow.

{code}
REGISTER tomar-1.5.2-SNAPSHOT.jar;
/* file: tiff2jp2_migrate.pig */

/* Built from https://github.com/openplanets/tomar */
REGISTER /home/onbfue/ToMaR/target/tomar-1.5.2-SNAPSHOT.jar;

DEFINE ToMarService eu.scape_project.pt.udf.ControlLineUDF();
DEFINE XPathService eu.scape_project.pt.udf.XPathFunction();

SET job.name 'Tomar-Pig-Taverna-OpenJpeg';

/* make sure that one task per input file is created */
SET pig.noSplitCombination true;

SET mapred.task.timeout 420000

%DECLARE toolspecs_path '/user/onbfue/alan/toolspecs'; '/hdfs/path/to/toolspecs';
%DECLARE xpath_exp1 '/fits/filestatus/valid';
%DECLARE xpath_exp2 '/fits/identification/identity/@mimetype';
%DECLARE xpath_exp3 '/jpylyzer/isValidJP2';

/* STEP 1: load image paths - Taverna: image_paths_from_dir */
image_pathes = LOAD '$image_pathes' USING PigStorage() AS (image_path: chararray);

/* STEP 2: validation of tiff image files using fits - Taverna: fitsValidation fits  */
fits = FOREACH image_pathes GENERATE image_path as image_path, ToMarService('$toolspecs_path', CONCAT(CONCAT('fits stdxml --input="hdfs://', image_path), '"')) as xml_text;

/* STEP 3: extract fits validity and mime-type using xpath */
/* STEP 3: extract tiff validity using xpath - Taverna: XPathJhove2 */fits_validation_list fits_validation_list = FOREACH fits GENERATE image_path, XPathService('$xpath_exp1', xml_text) AS node_list1, XPathService('$xpath_exp2', xml_text) AS node_list2;
fits_validation = FOREACH fits_validation_list GENERATE image_path, FLATTEN(node_list1) as node1, FLATTEN(node_list2) as node2;
store fits into 'output/fits';
store fits_validation into 'output/fits_validation';

/* STEP 45: migration of tiff image files to jpeg2000 - Taverna: opj_compress */
openjpeg = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(  CONCAT('openjpeg image-to-j2k --input="hdfs://', image_path), '" --output="'),  CONCAT(  CONCAT(   CONCAT('hdfs://', image_path), '.jp2'),'"'))) as ret_str;
STORE openjpeg INTO 'output/openjpeg';

/* STEP 56: validation of migrated jpeg2000 files using jpylyzer - Taverna: jpylyzerValidation */
jpylyzer = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT(CONCAT(CONCAT('jpylyzer validate --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'),CONCAT(CONCAT( CONCAT('hdfs://', image_path), '.jp2.xml'),'"'))) as jpy_xml;
STORE jpylyzer INTO 'output/jpylyzer';

/* STEP 6: extract jpylyzer validity using xpath - Taverna: XPathJpylyzer */
/* STEP 7: extract jpylyzer validity using xpath */
jpylyzer_validation_list = FOREACH jpylyzer GENERATE image_path, XPathService('$xpath_exp3', jpy_xml) AS jpy_node_list;
jpylyzer_validation = FOREACH jpylyzer_validation_list GENERATE image_path, FLATTEN(jpy_node_list) as node1;
store jpylyzer_validation into 'output/jpylyzer_validation';

/* STEP 78: migrate jpeg2000 image file back to tiff - Taverna: opj_decompress */
j2k_to_img = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(  CONCAT('openjpeg j2k-to-image --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'),  CONCAT(  CONCAT(   CONCAT('hdfs://', image_path), '.jp2.tif'),'"'))) as j2k_to_img_ret_str;
STORE j2k_to_img INTO 'output/j2k_to_img';

/* STEP 8: compare orignal to restored image file - Tavera: compare */
/* STEP 9: compare orignal to restored image file */
imgcompare = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(CONCAT('imagemagick compare-pixelwise --inputfirst="hdfs://', image_path), CONCAT(CONCAT('" --inputsecond="hdfs://',CONCAT(image_path,'.jp2.tif')),'" --diffoutput="hdfs://')),CONCAT(image_path,'.cmp.txt"'))) as imgcompare_ret_str;
STORE imgcompare INTO 'output/imgcompare';


The following ToMaR tool specification files were used in this experiment:
* [^openjpeg.xml]
* [^imagemagick.xml] ([^compare.sh])
* [^fits.xml]
* [^jpylyzer.xml]

Note that these XML-based tool descriptions must be stored in the directory _/hdfs/path/to/toolspecs_ which is declared as the _toolspecs_path_ variable in the pig script above.


The script is then executed as follows:

{code}
pig -param image_paths=/hdfs/path/to/imagefiles/ tiff2jp2_migrate.pig
{code}
and produces the result files in the same directory where the input image files are located, for example, input image path /hdfs/path/to/imagefiles/imagefile.tif:
# /hdfs/path/to/imagefiles/imagefile.tif.jp2 (result of the conversion to JP2)
# /hdfs/path/to/imagefiles/imagefile.tif.jp2.tif (result of the re-conversion to TIF)
# /hdfs/path/to/imagefiles/imagefile.tif.txt (result of the pixel-wise comparison between original and re-converted TIF files)

h2. Evaluation summary

Files := Size of random sample
| | | | | |
Total GB := Total size in Gigabytes
| | | | | |
Secs := Processing time in seconds
| | | | | |
Mins := Processing time in minutes
| | | | | |
Hrs := Processing time in hours
| | | | | |
Afg.p.f. := Average processing time per file in seconds
| | | | | |
Obj/h := Number of objects processed per hour
| | | | | |
GB/min := Throughput in Gigabytes per minute
| | | | | |
GB/min := Throughput in Gigabytes per hour
| | | | | |
Err := Number of processing errors
| | | | | |
| | | | | |
h3. Taverna Workflow - Sequential execution
| | | | | |
| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* |
| 5 | 0,31 GB | 179 | 2,98 | 0,05 | 35,80 | 101 | 0,10 | 6,22 | 0 |
| 7 | 0,89 GB | 438 | 7,30 | 0,12 | 62,57 | 58 | 0,12 | 7,29 | 0 |
| 10 | 0,90 GB | 478 | 7,97 | 0,13 | 47,80 | 75 | 0,11 | 6,8 | 0 |
| 20 | 2,23 GB | 1150 | 19,17 | 0,32 | 57,50 | 63 | 0,12 | 6,98 | 0 |
| 30 | 2,99 GB | 1541 | 25,68 | 0,43 | 51,37 | 70 | 0,12 | 6,98 | 0 |
| 40 | 3,60 GB | 1900 | 31,67 | 0,53 | 47,50 | 76 | 0,11 | 6,81 | 0 |
| 50 | 3,46 GB | 2039 | 33,98 | 0,57 | 40,78 | 88 | 0,10 | 6,1 | 0 |
| 75 | 6,05 GB | 3425 | 57,08 | 0,95 | 45,67 | 79 | 0,11 | 6,36 | 0 |
| 100 | 8,30 GB | 4693 | 78,22 | 1,30 | 46,93 | 77 | 0,11 | 6,37 | 0 |
| 200 | 15,19 GB | 9246 | 154,10 | 2,57 | 46,23 | 78 | 0,10 | 5,91 | 0 |
| 300 | 19,07 GB | 11773 | 196,22 | 3,27 | 39,24 | 92 | 0,10 | 5,83 | 0 |
| 400 | 24,78 GB | 15644 | 260,73 | 4,35 | 39,11 | 92 | 0,10 | 5,70 | 0 |
| 500 | 34,55 GB | 21345 | 355,75 | 5,93 | 42,69 | 84 | 0,10 | 5,82 | 0 |
| 750 | 63,07 GB | 37397 | 623,28 | 10,39 | 49,86 | 72 | 0,10 | 6,07 | 0 |
| 1000 | 71,82 GB | 42376 | 706,27 | 11,77 | 42,38 | 85 | 0,10 | 6,10 | 0 |
| 2000 | 139,00 GB | 84938 | 1415,63 | 23,59 | 42,47 | 85 | 0,10 | 5,89 | 0 |
| 3000 | 211,85 GB | 128959 | 2149,32 | 35,82 | 42,99 | 84 | 0,10 | 5,91 | 0 |


h3. Pig Workflow - Distributed Execution

| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* |
| 5 | 0,31 GB | 96 | 1,60 | 0,03 | 19,20 | 188 | 0,19 | 11,60 | 0 |
| 7 | 0,89 GB | 101 | 1,68 | 0,03 | 14,43 | 250 | 0,53 | 31,64 | 0 |
| 10 | 0,90 GB | 103 | 1,72 | 0,03 | 10,30 | 350 | 0,53 | 31,56 | 0 |
| 20 | 2,23 GB | 114 | 1,90 | 0,03 | 5,70 | 632 | 1,17 | 70,45 | 0 |
| 30 | 2,99 GB | 138 | 2,30 | 0,04 | 4,60 | 783 | 1,30 | 77,99 | 0 |
| 40 | 3,60 GB | 161 | 2,68 | 0,04 | 4,03 | 894 | 1,34 | 80,41 | 0 |
| 50 | 3,46 GB | 183 | 3,05 | 0,05 | 3,66 | 984 | 1,13 | 68,01 | 0 |
| 75 | 6,05 GB | 272 | 4,53 | 0,08 | 3,63 | 993 | 1,34 | 80,11 | 0 |
| 100 | 8,30 GB | 373 | 6,22 | 0,10 | 3,73 | 965 | 1,34 | 80,15 | 0 |
| 200 | 15,19 GB | 669 | 11,15 | 0,19 | 3,35 | 1076 | 1,36 | 81,73 | 0 |
| 300 | 19,07 GB | 808 | 13,47 | 0,22 | 2,69 | 1337 | 1,42 | 84,95 | 0 |
| 400 | 24,78 GB | 1091 | 18,18 | 0,30 | 2,73 | 1320 | 1,36 | 81,77 | 0 |
| 500 | 34,55 GB | 1397 | 23,28 | 0,39 | 2,79 | 1288 | 1,48 | 89,03 | 0 |
| 750 | 63,07 GB | 2399 | 39,98 | 0,67 | 3,20 | 1125 | 1,58 | 94,64 | 0 |
| 1000 | 71,82 GB | 2746 | 45,77 | 0,76 | 2,75 | 1311 | 1,57 | 94,16 | 0 |
| 2000 | 139,00 GB | 5450 | 90,83 | 1,51 | 2,73 | 1321 | 1,53 | 91,82 | 0 |
| 3000 | 211,85 GB | 8328 | 138,80 | 2,31 | 2,78 | 1297 | 1,53 | 91,58 | 0 |
The following diagram shows the comparison of wall clock times in seconds (y-axis) of the Taverna workflow and the Pig workflow using an increasing number of files (x-axis).
!wallclocktime_concept_vs_scalable.PNG|border=1,width=648,height=425!

_Figure 3 (above): Wallclock times of concept workflow and scalable workflow_

However, the throughput we can reach using [this |SP:ONB Hadoop Platform]cluster and the chosen pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour -- GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. !throughput_gb_per_h.png|border=1,width=654,height=363!
_Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed_