compared with
Current by Sven Schlarb
on Jun 23, 2014 14:01.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (33)

View Page History
[SP:Austrian National Library Tresor Music Collection]



h2. Platform

[SP:ONB Hadoop Platform]



h2. Purpose of this experiment

!TavernaWorkflow4276.png|border=1,width=235,height=786!

_Figure 1 (above): Taverna workflow_


Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at [http://www.myexperiment.org/workflows/4276.html]
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG ([https://code.google.com/p/openjpeg|https://code.google.com/p/openjpeg]).
The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the Austrian National Library Tresor Music Collection:

 
  !distribution_execution_times.PNG|border=1,width=473,height=264!

_Figure 2 (above): execution times of each of the workflows’ steps_

In the design phase this analysis is used to examine the average execution times for the individual tools. As a consequence of this experiment we might conclude, that over 4 seconds for the the FITS-based TIF image validation takes too much time and that this processing step needs to be improved, while the Jpylyzer validation is acceptable taking only slightly more than 1 second per image file in average.

| 100 | 8,30 GB | 4693 | 78,22 | 1,30 | 46,93 | 77 | 0,11 | 6,37 | 0 |
| 200 | 15,19 GB | 9246 | 154,10 | 2,57 | 46,23 | 78 | 0,10 | 5,91 | 0 |
| 300 | 19,07 GB | 11773 | 196,22 | 3,27 | 39,24 | 92 | 0,10 | 5,830538247 | 0 |
| 400 | 24,78 GB | 15644 | 260,73 | 4,35 | 39,11 | 92 | 0,10 | 5,702311971 | 0 |
| 500 | 34,55 GB | 21345 | 355,75 | 5,93 | 42,69 | 84 | 0,10 | 5,826842116 | 0 |
| 750 | 63,07 GB | 37397 | 623,28 | 10,39 | 49,86 | 72 | 0,10 | 6,07104919 | 0 |
| 1000 | 71,82 GB | 42376 | 706,27 | 11,77 | 42,38 | 85 | 0,10 | 6,101491294 | 0 |
| 2000 | 139,00 GB | 84938 | 1415,63 | 23,59 | 42,47 | 85 | 0,10 | 5,891372268 | 0 |
| 3000 | 211,85 GB | 128959 | 2149,32 | 35,82 | 42,99 | 84 | 0,10 | 5,913838368 | 0 |


h3. Pig Workflow - Distributed Execution

| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* |
| 5 | 0,31 GB | 96 | 1,60 | 0,03 | 19,20 | 188 | 0,19 | 11,60 | 0 |
| 7 | 0,89 GB | 101 | 1,68 | 0,03 | 14,43 | 250 | 0,53 | 31,64 | 0 |
| 10 | 0,90 GB | 103 | 1,72 | 0,03 | 10,30 | 350 | 0,53 | 31,56 | 0 |
| 20 | 2,23 GB | 114 | 1,90 | 0,03 | 5,70 | 632 | 1,17 | 70,45 | 0 |
| 30 | 2,99 GB | 138 | 2,30 | 0,04 | 4,60 | 783 | 1,30 | 77,99 | 0 |
| 40 | 3,60 GB | 161 | 2,68 | 0,04 | 4,03 | 894 | 1,34 | 80,41 | 0 |
| 50 | 3,46 GB | 183 | 3,05 | 0,05 | 3,66 | 984 | 1,13 | 68,01 | 0 |
| 75 | 6,05 GB | 272 | 4,53 | 0,08 | 3,63 | 993 | 1,34 | 80,11 | 0 |
| 100 | 8,30 GB | 373 | 6,22 | 0,10 | 3,73 | 965 | 1,34 | 80,15 | 0 |
| 200 | 15,19 GB | 669 | 11,15 | 0,19 | 3,35 | 1076 | 1,36 | 81,73 | 0 |
| 300 | 19,07 GB | 808 | 13,47 | 0,22 | 2,69 | 1337 | 1,42 | 84,95 | 0 |
| 400 | 24,78 GB | 1091 | 18,18 | 0,30 | 2,73 | 1320 | 1,36 | 81,77 | 0 |
| 500 | 34,55 GB | 1397 | 23,28 | 0,39 | 2,79 | 1288 | 1,48 | 89,03 | 0 |
| 750 | 63,07 GB | 2399 | 39,98 | 0,67 | 3,20 | 1125 | 1,58 | 94,64 | 0 |
| 1000 | 71,82 GB | 2746 | 45,77 | 0,76 | 2,75 | 1311 | 1,57 | 94,16 | 0 |
| 2000 | 139,00 GB | 5450 | 90,83 | 1,51 | 2,73 | 1321 | 1,53 | 91,82 | 0 |
| 3000 | 211,85 GB | 8328 | 138,80 | 2,31 | 2,78 | 1297 | 1,53 | 91,58 | 0 |
The following diagram shows the comparison of wall clock times in seconds (y-axis) of the Taverna workflow and the Pig workflow using an increasing number of files (x-axis).
!wallclocktime_concept_vs_scalable.PNG|border=1,width=648,height=425!

_Figure 3 (above): Wallclock times of concept workflow and scalable workflow_

However, the throughput we can reach using [this |SP:ONB Hadoop Platform]cluster and the chosen pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour -- GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. !throughput_gb_per_h.png|border=1,width=654,height=363!
_Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed_