Skip to end of metadata
Go to start of metadata

Investigator(s)

Sven Schlarb

Dataset

Austrian National Library Tresor Music Collection

Platform

ONB Hadoop Platform

Purpose of this experiment

The purpose of this experiment is to evaluate the performance of a scalable workflow for migrating TIFF images to images in the JPEG2000 format compared to an equivalent Taverna version of the workflow processing the data sequentially.

Evaluation method

A Taverna workflow for sequential processing serves as a reference point for the large-scale execution. Out of the full Austrian National Library Tresor Music Collection data set subsets of increasing size are selected by a random process.

The following bash statement is used to create a random sample from the full data set:

The statement prepends a random number to the file paths list and orders the list subsequently. Variable NUM is the desired size of the data set. The resulting file contains the local file paths and can be used as input for the Taverna workflow presented in the next section.

Additionally the files are uploaded to HDFS as input for the large-scale workflow execution.

By that way it is possible to compare the sequential execution time to the large-scale processing time.

Taverna workflow - sequential processing

The proof-of-concept version of the TIFF to JPEG2000 image migration workflow with quality assurance was created as a Taverna workflow illustrated by the following workflow diagram:

Figure 1 (above): Taverna workflow

Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at http://www.myexperiment.org/workflows/4276.html
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG (https://code.google.com/p/openjpeg).

Based on the input text file, the workflow creates a Taverna list to be processed file by file. A temporary directory is created (createtmpdir) where the migrated image files and some temporary tool outputs are stored.

Before starting the actual migration, it is checked if the TIF input images are valid file format instances using Fits (https://code.google.com/p/fits, JHove2 under the hood, http://www.jhove2.org). An XPath service is used to extract the validity information from the XML-based Fits validation report.

If the images are valid TIF images, they are migrated to the JPEG2000 (JP2) image file format using OpenJPEG 2.0 (opj_compress).

Subsequently, it is again checked if the migrated images are valid JP2 images using SCAPE tool Jpylyzer (http://www.openplanetsfoundation.org/software/jpylyzer). An XPath service (XPathJpylyzer) is used to extract the validity information from the XML-based Jpylyzer validation report.

Finally, we verify if the migrated JP2 images are valid surrogates of the original TIF images by restoring the original TIF image from the converted JP2 image and comparing whether original and restored images are identical.

The sequential execution of this workflow is used as a reference point for measuring the parallelisation efficiency of the scalable version and it allows measuring how the processing times of the different components compare to each other.

The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the Austrian National Library Tresor Music Collection:

 

Figure 2 (above): execution times of each of the workflows’ steps

In the design phase this analysis is used to examine the average execution times for the individual tools. As a consequence of this experiment we might conclude, that over 4 seconds for the the FITS-based TIF image validation takes too much time and that this processing step needs to be improved, while the Jpylyzer validation is acceptable taking only slightly more than 1 second per image file in average.

SCAPE Platform workflow - distributed processing

Apache Pig was used to create a scalable version of this workflow. The different processing steps of the Taverna workflow for sequential processing are represented by Pig Latin statements.

The comments of each processing step In the script below indicate which is the corresponding processing component in the Taverna workflow.

The following ToMaR tool specification files were used in this experiment:

Note that these XML-based tool descriptions must be stored in the directory /hdfs/path/to/toolspecs which is declared as the toolspecs_path variable in the pig script above.

The script is then executed as follows:

and produces the result files in the same directory where the input image files are located, for example, input image path /hdfs/path/to/imagefiles/imagefile.tif:

  1. /hdfs/path/to/imagefiles/imagefile.tif.jp2 (result of the conversion to JP2)
  2. /hdfs/path/to/imagefiles/imagefile.tif.jp2.tif (result of the re-conversion to TIF)
  3. /hdfs/path/to/imagefiles/imagefile.tif.txt (result of the pixel-wise comparison between original and re-converted TIF files)

Evaluation summary

Files := Size of random sample

Total GB := Total size in Gigabytes

Secs := Processing time in seconds

Mins := Processing time in minutes

Hrs := Processing time in hours

Afg.p.f. := Average processing time per file in seconds

Obj/h := Number of objects processed per hour

GB/min := Throughput in Gigabytes per minute

GB/min := Throughput in Gigabytes per hour

Err := Number of processing errors

Taverna Workflow - Sequential execution

Files Total GB Secs Mins Hrs Avg.p.f. Obj/h GB/min GB/h Err
5 0,31 GB 179 2,98 0,05 35,80 101 0,10 6,22 0
7 0,89 GB 438 7,30 0,12 62,57 58 0,12 7,29 0
10 0,90 GB 478 7,97 0,13 47,80 75 0,11 6,8 0
20 2,23 GB 1150 19,17 0,32 57,50 63 0,12 6,98 0
30 2,99 GB 1541 25,68 0,43 51,37 70 0,12 6,98 0
40 3,60 GB 1900 31,67 0,53 47,50 76 0,11 6,81 0
50 3,46 GB 2039 33,98 0,57 40,78 88 0,10 6,1 0
75 6,05 GB 3425 57,08 0,95 45,67 79 0,11 6,36 0
100 8,30 GB 4693 78,22 1,30 46,93 77 0,11 6,37 0
200 15,19 GB 9246 154,10 2,57 46,23 78 0,10 5,91 0
300 19,07 GB 11773 196,22 3,27 39,24 92 0,10 5,83 0
400 24,78 GB 15644 260,73 4,35 39,11 92 0,10 5,70 0
500 34,55 GB 21345 355,75 5,93 42,69 84 0,10 5,82 0
750 63,07 GB 37397 623,28 10,39 49,86 72 0,10 6,07 0
1000 71,82 GB 42376 706,27 11,77 42,38 85 0,10 6,10 0
2000 139,00 GB 84938 1415,63 23,59 42,47 85 0,10 5,89 0
3000 211,85 GB 128959 2149,32 35,82 42,99 84 0,10 5,91 0

Pig Workflow - Distributed Execution

Files Total GB Secs Mins Hrs Avg.p.f. Obj/h GB/min GB/h Err
5 0,31 GB 96 1,60 0,03 19,20 188 0,19 11,60 0
7 0,89 GB 101 1,68 0,03 14,43 250 0,53 31,64 0
10 0,90 GB 103 1,72 0,03 10,30 350 0,53 31,56 0
20 2,23 GB 114 1,90 0,03 5,70 632 1,17 70,45 0
30 2,99 GB 138 2,30 0,04 4,60 783 1,30 77,99 0
40 3,60 GB 161 2,68 0,04 4,03 894 1,34 80,41 0
50 3,46 GB 183 3,05 0,05 3,66 984 1,13 68,01 0
75 6,05 GB 272 4,53 0,08 3,63 993 1,34 80,11 0
100 8,30 GB 373 6,22 0,10 3,73 965 1,34 80,15 0
200 15,19 GB 669 11,15 0,19 3,35 1076 1,36 81,73 0
300 19,07 GB 808 13,47 0,22 2,69 1337 1,42 84,95 0
400 24,78 GB 1091 18,18 0,30 2,73 1320 1,36 81,77 0
500 34,55 GB 1397 23,28 0,39 2,79 1288 1,48 89,03 0
750 63,07 GB 2399 39,98 0,67 3,20 1125 1,58 94,64 0
1000 71,82 GB 2746 45,77 0,76 2,75 1311 1,57 94,16 0
2000 139,00 GB 5450 90,83 1,51 2,73 1321 1,53 91,82 0
3000 211,85 GB 8328 138,80 2,31 2,78 1297 1,53 91,58 0

The following diagram shows the comparison of wall clock times in seconds (y-axis) of the Taverna workflow and the Pig workflow using an increasing number of files (x-axis).

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

However, the throughput we can reach using this cluster and the chosen pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour – GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files.
Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.