h2. SCAPE Platform workflow - distributed processing
[Apache Pig|http://pig.apache.org/] was used to create a scalable version of this workflow. The different processing steps of the Taverna workflow for sequential processing are represented by Pig Latin statements.
The comments of each processing step In the script below indicate which is the corresponding processing component in the Taverna workflow.
{code}
/* file: tiff2jp2_migrate.pig */
/* Built from https://github.com/openplanets/tomar */
REGISTER /home/onbfue/ToMaR/target/tomar-1.5.2-SNAPSHOT.jar;
DEFINE ToMarService eu.scape_project.pt.udf.ControlLineUDF();
DEFINE XPathService eu.scape_project.pt.udf.XPathFunction();
SET job.name 'Tomar-Pig-Taverna-OpenJpeg';
/* make sure that one task per input file is created */
SET pig.noSplitCombination true;
SET mapred.task.timeout 420000
%DECLARE toolspecs_path '/hdfs/path/to/toolspecs';
%DECLARE xpath_exp1 '/fits/filestatus/valid';
%DECLARE xpath_exp2 '/fits/identification/identity/@mimetype';
%DECLARE xpath_exp3 '/jpylyzer/isValidJP2';
/* STEP 1: load image paths */
image_paths = LOAD '$image_paths' USING PigStorage() AS (image_path: chararray);
/* STEP 2: validation of tiff image files using fits */
fits = FOREACH image_paths GENERATE image_path as image_path, ToMarService('$toolspecs_path', CONCAT(CONCAT('fits stdxml --input="hdfs://', image_path), '"')) as xml_text;
/* STEP 3: extract fits validity and mime-type using xpath */
fits_validation_list = FOREACH fits GENERATE image_path, XPathService('$xpath_exp1', xml_text) AS node_list1, XPathService('$xpath_exp2', xml_text) AS node_list2;
fits_validation = FOREACH fits_validation_list GENERATE image_path, FLATTEN(node_list1) as node1, FLATTEN(node_list2) as node2;
store fits into 'output/fits';
store fits_validation into 'output/fits_validation';
/* STEP 5: migration of tiff image files to jpeg2000 */
openjpeg = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT( CONCAT('openjpeg image-to-j2k --input="hdfs://', image_path), '" --output="'), CONCAT( CONCAT( CONCAT('hdfs://', image_path), '.jp2'),'"'))) as ret_str;
STORE openjpeg INTO 'output/openjpeg';
/* STEP 6: validation of migrated jpeg2000 files using jpylyzer */
jpylyzer = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT(CONCAT(CONCAT('jpylyzer validate --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'),CONCAT(CONCAT( CONCAT('hdfs://', image_path), '.jp2.xml'),'"'))) as jpy_xml;
STORE jpylyzer INTO 'output/jpylyzer';
/* STEP 7: extract jpylyzer validity using xpath */
jpylyzer_validation_list = FOREACH jpylyzer GENERATE image_path, XPathService('$xpath_exp3', jpy_xml) AS jpy_node_list;
jpylyzer_validation = FOREACH jpylyzer_validation_list GENERATE image_path, FLATTEN(jpy_node_list) as node1;
store jpylyzer_validation into 'output/jpylyzer_validation';
/* STEP 8: migrate jpeg2000 image file back to tiff */
j2k_to_img = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT( CONCAT('openjpeg j2k-to-image --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'), CONCAT( CONCAT( CONCAT('hdfs://', image_path), '.jp2.tif'),'"'))) as j2k_to_img_ret_str;
STORE j2k_to_img INTO 'output/j2k_to_img';
/* STEP 9: compare orignal to restored image file */
imgcompare = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(CONCAT('imagemagick compare-pixelwise --inputfirst="hdfs://', image_path), CONCAT(CONCAT('" --inputsecond="hdfs://',CONCAT(image_path,'.jp2.tif')),'" --diffoutput="hdfs://')),CONCAT(image_path,'.cmp.txt"'))) as imgcompare_ret_str;
STORE imgcompare INTO 'output/imgcompare';
{code}
The following ToMaR tool specification files were used in this experiment:
* [^openjpeg.xml]
* [^imagemagick.xml] ([^compare.sh])
* [^fits.xml]
* [^jpylyzer.xml]
Note that these XML-based tool descriptions must be stored in the directory _/hdfs/path/to/toolspecs_ which is declared as the _toolspecs_path_ variable in the pig script above.
The script is then executed as follows:
{code}
pig -param image_paths=/hdfs/path/to/imagefiles/ tiff2jp2_migrate.pig
{code}
and produces the result files in the same directory where the input image files are located, for example, input image path /hdfs/path/to/imagefiles/imagefile.tif:
# /hdfs/path/to/imagefiles/imagefile.tif.jp2 (result of the conversion to JP2)
# /hdfs/path/to/imagefiles/imagefile.tif.jp2.tif (result of the re-conversion to TIF)
# /hdfs/path/to/imagefiles/imagefile.tif.txt (result of the pixel-wise comparison between original and re-converted TIF files)
h2. Evaluation summary
Files := Size of random sample
Total GB := Total size in Gigabytes
Secs := Processing time in seconds
Mins := Processing time in minutes
Hrs := Processing time in hours
Afg.p.f. := Average processing time per file in seconds
Obj/h := Number of objects processed per hour
GB/min := Throughput in Gigabytes per minute
GB/min := Throughput in Gigabytes per hour
Err := Number of processing errors
h3. Taverna Workflow - Sequential execution
| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* |
| 5 | 0,31 GB | 179 | 2,98 | 0,05 | 35,80 | 101 | 0,10 | 6,22 | 0 |
| 7 | 0,89 GB | 438 | 7,30 | 0,12 | 62,57 | 58 | 0,12 | 7,29 | 0 |
| 10 | 0,90 GB | 478 | 7,97 | 0,13 | 47,80 | 75 | 0,11 | 6,8 | 0 |
| 20 | 2,23 GB | 1150 | 19,17 | 0,32 | 57,50 | 63 | 0,12 | 6,98 | 0 |
| 30 | 2,99 GB | 1541 | 25,68 | 0,43 | 51,37 | 70 | 0,12 | 6,98 | 0 |
| 40 | 3,60 GB | 1900 | 31,67 | 0,53 | 47,50 | 76 | 0,11 | 6,81 | 0 |
| 50 | 3,46 GB | 2039 | 33,98 | 0,57 | 40,78 | 88 | 0,10 | 6,1 | 0 |
| 75 | 6,05 GB | 3425 | 57,08 | 0,95 | 45,67 | 79 | 0,11 | 6,36 | 0 |
| 100 | 8,30 GB | 4693 | 78,22 | 1,30 | 46,93 | 77 | 0,11 | 6,37 | 0 |
| 200 | 15,19 GB | 9246 | 154,10 | 2,57 | 46,23 | 78 | 0,10 | 5,91 | 0 |
| 300 | 19,07 GB | 11773 | 196,22 | 3,27 | 39,24 | 92 | 0,10 | 5,83 | 0 |
| 400 | 24,78 GB | 15644 | 260,73 | 4,35 | 39,11 | 92 | 0,10 | 5,70 | 0 |
| 500 | 34,55 GB | 21345 | 355,75 | 5,93 | 42,69 | 84 | 0,10 | 5,82 | 0 |
| 750 | 63,07 GB | 37397 | 623,28 | 10,39 | 49,86 | 72 | 0,10 | 6,07 | 0 |
| 1000 | 71,82 GB | 42376 | 706,27 | 11,77 | 42,38 | 85 | 0,10 | 6,10 | 0 |
| 2000 | 139,00 GB | 84938 | 1415,63 | 23,59 | 42,47 | 85 | 0,10 | 5,89 | 0 |
| 3000 | 211,85 GB | 128959 | 2149,32 | 35,82 | 42,99 | 84 | 0,10 | 5,91 | 0 |
h3. Pig Workflow - Distributed Execution
| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* |
| 5 | 0,31 GB | 96 | 1,60 | 0,03 | 19,20 | 188 | 0,19 | 11,60 | 0 |
| 7 | 0,89 GB | 101 | 1,68 | 0,03 | 14,43 | 250 | 0,53 | 31,64 | 0 |
| 10 | 0,90 GB | 103 | 1,72 | 0,03 | 10,30 | 350 | 0,53 | 31,56 | 0 |
| 20 | 2,23 GB | 114 | 1,90 | 0,03 | 5,70 | 632 | 1,17 | 70,45 | 0 |
| 30 | 2,99 GB | 138 | 2,30 | 0,04 | 4,60 | 783 | 1,30 | 77,99 | 0 |
| 40 | 3,60 GB | 161 | 2,68 | 0,04 | 4,03 | 894 | 1,34 | 80,41 | 0 |
| 50 | 3,46 GB | 183 | 3,05 | 0,05 | 3,66 | 984 | 1,13 | 68,01 | 0 |
| 75 | 6,05 GB | 272 | 4,53 | 0,08 | 3,63 | 993 | 1,34 | 80,11 | 0 |
| 100 | 8,30 GB | 373 | 6,22 | 0,10 | 3,73 | 965 | 1,34 | 80,15 | 0 |
| 200 | 15,19 GB | 669 | 11,15 | 0,19 | 3,35 | 1076 | 1,36 | 81,73 | 0 |
| 300 | 19,07 GB | 808 | 13,47 | 0,22 | 2,69 | 1337 | 1,42 | 84,95 | 0 |
| 400 | 24,78 GB | 1091 | 18,18 | 0,30 | 2,73 | 1320 | 1,36 | 81,77 | 0 |
| 500 | 34,55 GB | 1397 | 23,28 | 0,39 | 2,79 | 1288 | 1,48 | 89,03 | 0 |
| 750 | 63,07 GB | 2399 | 39,98 | 0,67 | 3,20 | 1125 | 1,58 | 94,64 | 0 |
| 1000 | 71,82 GB | 2746 | 45,77 | 0,76 | 2,75 | 1311 | 1,57 | 94,16 | 0 |
| 2000 | 139,00 GB | 5450 | 90,83 | 1,51 | 2,73 | 1321 | 1,53 | 91,82 | 0 |
| 3000 | 211,85 GB | 8328 | 138,80 | 2,31 | 2,78 | 1297 | 1,53 | 91,58 | 0 |