compared with
Current by Sven Schlarb
on Jun 23, 2014 14:01.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (8)

View Page History



h2. Platform




h2. Purpose of this experiment

!TavernaWorkflow4276.png|border=1,width=235,height=786!

_Figure 1 (above): Taverna workflow_


Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at [http://www.myexperiment.org/workflows/4276.html]
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG ([https://code.google.com/p/openjpeg|https://code.google.com/p/openjpeg]).
The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the Austrian National Library Tresor Music Collection:

 
  !distribution_execution_times.PNG|border=1,width=473,height=264!

_Figure 2 (above): execution times of each of the workflows’ steps_

In the design phase this analysis is used to examine the average execution times for the individual tools. As a consequence of this experiment we might conclude, that over 4 seconds for the the FITS-based TIF image validation takes too much time and that this processing step needs to be improved, while the Jpylyzer validation is acceptable taking only slightly more than 1 second per image file in average.

The following diagram shows the comparison of wall clock times in seconds (y-axis) of the Taverna workflow and the Pig workflow using an increasing number of files (x-axis).
!wallclocktime_concept_vs_scalable.PNG|border=1,width=648,height=425!

_Figure 3 (above): Wallclock times of concept workflow and scalable workflow_

However, the throughput we can reach using [this |SP:ONB Hadoop Platform]cluster and the chosen pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour -- GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. !throughput_gb_per_h.png|border=1,width=654,height=363!
_Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed_