compared with
Current by Sven Schlarb
on Jun 23, 2014 14:01.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (38)

View Page History


















h2. Platform



























h2. Purpose of this experiment

Additionally the files are uploaded to HDFS as input for the large-scale workflow execution.

By that way it is possible to compare the sequential execution time to the large-scale processing time. The results depend on the size and configuration of the Hadoop cluster, therefore the relation between sequential and large-scale processing is expressed by the "parallelisation efficiency" which is the sequential execution time, divided by the distributed execution time, and again divided by the number of nodes available for parallel processing:
By that way it is possible to compare the sequential execution time to the large-scale processing time.

{code}
e := parallelisation efficiency
d := distributed execution time in seconds
s := sequential execution time in seconds
n := number of nodes (cores) available for parallel processing

e = s/d/n;
{code}




h2. Taverna workflow - sequential processing

!TavernaWorkflow4276.png|border=1,width=235,height=786!

_Figure 1 (above): Taverna workflow_


Diagram of the TIFF to JPEG2000 image migration workflow, Workflow available on MyExperiment at [http://www.myexperiment.org/workflows/4276.html]
The Taverna workflow reads a textfile containing absolute paths to TIF image files and converts them to JP2 image files using OpenJPEG ([https://code.google.com/p/openjpeg|https://code.google.com/p/openjpeg]).
The following diagram shows the average execution time of each component of the workflow in seconds and was created from a 1000 images sample of the Austrian National Library Tresor Music Collection:

 
  !distribution_execution_times.PNG|border=1,width=473,height=264!

_Figure 2 (above): execution times of each of the workflows’ steps_

In the design phase this analysis is used to examine the average execution times for the individual tools. As a consequence of this experiment we might conclude, that over 4 seconds for the the FITS-based TIF image validation takes too much time and that this processing step needs to be improved, while the Jpylyzer validation is acceptable taking only slightly more than 1 second per image file in average.


{code}
pig -param image_paths=/hdfs/path/to/imagefiles/ tiff2jp2_migrate.pig
{code}
and produces the result files in the same directory where the input image files are located, for example, input image path /hdfs/path/to/imagefiles/imagefile.tif:


Par.Eff. := Parallelisation efficiency (sequential execution time, divided by the distributed execution time, and again divided by the number of nodes available for parallel processing)



h3. Taverna Workflow - Sequential execution

| 100 | 8,30 GB | 4693 | 78,22 | 1,30 | 46,93 | 77 | 0,11 | 6,37 | 0 |
| 200 | 15,19 GB | 9246 | 154,10 | 2,57 | 46,23 | 78 | 0,10 | 5,91 | 0 |
| 300 | 19,07 GB | 11773 | 196,22 | 3,27 | 39,24 | 92 | 0,10 | 5,830538247 | 0 |
| 400 | 24,78 GB | 15644 | 260,73 | 4,35 | 39,11 | 92 | 0,10 | 5,702311971 | 0 |
| 500 | 34,55 GB | 21345 | 355,75 | 5,93 | 42,69 | 84 | 0,10 | 5,826842116 | 0 |
| 750 | 63,07 GB | 37397 | 623,28 | 10,39 | 49,86 | 72 | 0,10 | 6,07104919 | 0 |
| 1000 | 71,82 GB | 42376 | 706,27 | 11,77 | 42,38 | 85 | 0,10 | 6,101491294 | 0 |
| 2000 | 139,00 GB | 84938 | 1415,63 | 23,59 | 42,47 | 85 | 0,10 | 5,891372268 | 0 |
| 3000 | 211,85 GB | 128959 | 2149,32 | 35,82 | 42,99 | 84 | 0,10 | 5,913838368 | 0 |


h3. Pig Workflow - Distributed Execution

| *Files* | *Total GB* | *Secs* | *Mins* | *Hrs* | *Avg.p.f.* | *Obj/h* | *GB/min* | *GB/h* | *Err* | *Par.Eff.* |
| 5 | 0,31 GB | 96 | 1,60 | 0,03 | 19,20 | 188 | 0,19 | 11,60 | 0 | 7,46% |
| 7 | 0,89 GB | 101 | 1,68 | 0,03 | 14,43 | 250 | 0,53 | 31,64 | 0 | 17,35% |
| 10 | 0,90 GB | 103 | 1,72 | 0,03 | 10,30 | 350 | 0,53 | 31,56 | 0 | 18,56% |
| 20 | 2,23 GB | 114 | 1,90 | 0,03 | 5,70 | 632 | 1,17 | 70,45 | 0 | 40,35% |
| 30 | 2,99 GB | 138 | 2,30 | 0,04 | 4,60 | 783 | 1,30 | 77,99 | 0 | 44,67% |
| 40 | 3,60 GB | 161 | 2,68 | 0,04 | 4,03 | 894 | 1,34 | 80,41 | 0 | 47,20% |
| 50 | 3,46 GB | 183 | 3,05 | 0,05 | 3,66 | 984 | 1,13 | 68,01 | 0 | 44,57% |
| 75 | 6,05 GB | 272 | 4,53 | 0,08 | 3,63 | 993 | 1,34 | 80,11 | 0 | 50,37% |
| 100 | 8,30 GB | 373 | 6,22 | 0,10 | 3,73 | 965 | 1,34 | 80,15 | 0 | 50,33% |
| 200 | 15,19 GB | 669 | 11,15 | 0,19 | 3,35 | 1076 | 1,36 | 81,73 | 0 | 55,28% |
| 300 | 19,07 GB | 808 | 13,47 | 0,22 | 2,69 | 1337 | 1,42 | 84,95 | 0 | 58,28% |
| 400 | 24,78 GB | 1091 | 18,18 | 0,30 | 2,73 | 1320 | 1,36 | 81,77 | 0 | 57,36% |
| 500 | 34,55 GB | 1397 | 23,28 | 0,39 | 2,79 | 1288 | 1,48 | 89,03 | 0 | 61,12% |
| 750 | 63,07 GB | 2399 | 39,98 | 0,67 | 3,20 | 1125 | 1,58 | 94,64 | 0 | 62,35% |
| 1000 | 71,82 GB | 2746 | 45,77 | 0,76 | 2,75 | 1311 | 1,57 | 94,16 | 0 | 61,73% |
| 2000 | 139,00 GB | 5450 | 90,83 | 1,51 | 2,73 | 1321 | 1,53 | 91,82 | 0 | 62,34% |
| 3000 | 211,85 GB | 8328 | 138,80 | 2,31 | 2,78 | 1297 | 1,53 | 91,58 | 0 | 61,94% |
The following diagram shows the comparison of wall clock times in seconds (y-axis) of the Taverna workflow and the Pig workflow using an increasing number of files (x-axis).
!wallclocktime_concept_vs_scalable.PNG|border=1,width=648,height=425!

_Figure 3 (above): Wallclock times of concept workflow and scalable workflow_

However, the throughput we can reach using [this |SP:ONB Hadoop Platform]cluster and the chosen pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour -- GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. !throughput_gb_per_h.png|border=1,width=654,height=363!
_Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed_