View Source

h2. Investigator(s)

Sven Schlarb

h2. Dataset

[SP:Austrian National Library Tresor Music Collection]








h2. Platform

[SP:ONB Hadoop Platform]







h2. Purpose of this experiment

The purpose of this experiment is to evaluate the performance of a scalable workflow for migrating TIFF images to images in the JPEG2000 format compared to an equivalent Taverna version of the workflow processing the data sequentially.



h2. Taverna workflow - sequential processing

The proof-of-concept version of the TIFF to JPEG2000 image migration workflow with quality assurance was created as a Taverna workflow illustrated by the following workflow diagram:

!TavernaWorkflow4276.png|border=1,width=235,height=786!


Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at [http://www.myexperiment.org/workflows/4276.html]
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG ([https://code.google.com/p/openjpeg|https://code.google.com/p/openjpeg]).

Based on the input text file, the workflow creates a Taverna list to be processed file by file. A temporary directory is created (createtmpdir) where the migrated image files and some temporary tool outputs are stored.

Before starting the actual migration, it is checked if the TIF input images are valid file format instances using Fits ([https://code.google.com/p/fits|https://code.google.com/p/fits], JHove2 under the hood, [http://www.jhove2.org|http://www.jhove2.org]). An XPath service is used to extract the validity information from the XML-based Fits validation report.

If the images are valid TIF images, they are migrated to the JPEG2000 (JP2) image file format using OpenJPEG 2.0 (opj_compress).

Subsequently, it is again checked if the migrated images are valid JP2 images using SCAPE tool Jpylyzer ([http://www.openplanetsfoundation.org/software/jpylyzer|http://www.openplanetsfoundation.org/software/jpylyzer]). An XPath service (XPathJpylyzer) is used to extract the validity information from the XML-based Jpylyzer validation report.

Finally, we verify if the migrated JP2 images are valid surrogates of the original TIF images by restoring the original TIF image from the converted JP2 image and comparing whether original and restored images are identical.

The sequential execution of this workflow is used as a reference point for measuring the parallelisation efficiency of the scalable version and it allows measuring how the processing times of the different components compare to each other.

The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the [Austrian National Library Tresor Music Collection|../../../../../../../../../../display/SP/Austrian+National+Library+Tresor+Music+Collection]: !distribution_execution_times.PNG|border=1,width=473,height=264!

h2. SCAPE Platform workflow - distributed processing

Apache Pig was used to create a scalable version of this workflow. The different processing steps of the Taverna workflow for sequential processing are represented by Pig Latin statements.

The comments of each processing step In the script below indicate which is the corresponding processing component in the Taverna workflow.


{code}
REGISTER tomar-1.5.2-SNAPSHOT.jar;

DEFINE ToMarService eu.scape_project.pt.udf.ControlLineUDF();
DEFINE XPathService eu.scape_project.pt.udf.XPathFunction();

SET job.name 'Tomar-Pig-Taverna-OpenJpeg';
SET pig.noSplitCombination true;

%DECLARE toolspecs_path '/user/onbfue/alan/toolspecs';
%DECLARE xpath_exp1 '/fits/filestatus/valid';
%DECLARE xpath_exp2 '/fits/identification/identity/@mimetype';
%DECLARE xpath_exp3 '/jpylyzer/isValidJP2';

/* STEP 1: load image paths - Taverna: image_paths_from_dir */
image_pathes = LOAD '$image_pathes' USING PigStorage() AS (image_path: chararray);

/* STEP 2: validation of tiff image files using fits - Taverna: fitsValidation */
fits = FOREACH image_pathes GENERATE image_path as image_path, ToMarService('$toolspecs_path', CONCAT(CONCAT('fits stdxml --input="hdfs://', image_path), '"')) as xml_text;

/* STEP 3: extract tiff validity using xpath - Taverna: XPathJhove2 */fits_validation_list = FOREACH fits GENERATE image_path, XPathService('$xpath_exp1', xml_text) AS node_list1, XPathService('$xpath_exp2', xml_text) AS node_list2;
fits_validation = FOREACH fits_validation_list GENERATE image_path, FLATTEN(node_list1) as node1, FLATTEN(node_list2) as node2;
store fits into 'output/fits';
store fits_validation into 'output/fits_validation';

/* STEP 4: migration of tiff image files to jpeg2000 - Taverna: opj_compress */
openjpeg = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(  CONCAT('openjpeg image-to-j2k --input="hdfs://', image_path), '" --output="'),  CONCAT(  CONCAT(   CONCAT('hdfs://', image_path), '.jp2'),'"'))) as ret_str;
STORE openjpeg INTO 'output/openjpeg';

/* STEP 5: validation of migrated jpeg2000 files using jpylyzer - Taverna: jpylyzerValidation */
jpylyzer = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT(CONCAT(CONCAT('jpylyzer validate --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'),CONCAT(CONCAT( CONCAT('hdfs://', image_path), '.jp2.xml'),'"'))) as jpy_xml;
STORE jpylyzer INTO 'output/jpylyzer';

/* STEP 6: extract jpylyzer validity using xpath - Taverna: XPathJpylyzer */
jpylyzer_validation_list = FOREACH jpylyzer GENERATE image_path, XPathService('$xpath_exp3', jpy_xml) AS jpy_node_list;
jpylyzer_validation = FOREACH jpylyzer_validation_list GENERATE image_path, FLATTEN(jpy_node_list) as node1;
store jpylyzer_validation into 'output/jpylyzer_validation';

/* STEP 7: migrate jpeg2000 image file back to tiff - Taverna: opj_decompress */
j2k_to_img = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(  CONCAT('openjpeg j2k-to-image --input="hdfs://', CONCAT(image_path,'.jp2')), '" --output="'),  CONCAT(  CONCAT(   CONCAT('hdfs://', image_path), '.jp2.tif'),'"'))) as j2k_to_img_ret_str;
STORE j2k_to_img INTO 'output/j2k_to_img';

/* STEP 8: compare orignal to restored image file - Tavera: compare */
imgcompare = FOREACH fits_validation GENERATE image_path as image_path, ToMarService('$toolspecs_path',CONCAT( CONCAT(CONCAT('imagemagick compare-pixelwise --inputfirst="hdfs://', image_path), CONCAT(CONCAT('" --inputsecond="hdfs://',CONCAT(image_path,'.jp2.tif')),'" --diffoutput="hdfs://')),CONCAT(image_path,'.cmp.txt"'))) as imgcompare_ret_str;
STORE imgcompare INTO 'output/imgcompare';
{code}


h2. Evaluation summary

| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |