Skip to end of metadata
Go to start of metadata

Evaluator(s)

William Palmer, British Library

Evaluation points

Assessment of measurable points

There are various runs of the migration workflow entered here:

The first part of title details the storage type:

  • HDFS: in HDFS on the Hadoop cluster
  • Webdav: stored on a NAS local to the Hadoop cluster
  • Fedora: stored in/via a Fedora Commons Repository, with object storage on the same NAS as for the Webdav code

The second part of the title described the execution method of the workflow (all using OpenJPEG unless explicitly labelled):

  • CommandLineJob: a Java controlled workflow i.e. native MapReduce calling out to external programs as required
  • CommandLineJob-Kakadu: same as CommandLineJob, but replacing OpenJPEG for Kakadu
  • Taverna: a Taverna workflow, called via the Taverna command line application, calling out to external programs as required
Metric Metric goal Evaluation Date Metric baseline - Batch on one processing node Fedora-CommandLineJob Webdav-CommandLineJob HDFS-CommandLineJob Fedora-Taverna HDFS-Taverna HDFS-CommandLineJob-Kakadu
TotalRuntime 40hours
Jan 14
38:08:30 57:50:00 57:58:00 57:02:00 68:05:00 67:11:00 17:25:00
NumberOfObjectsPerHour 1600
Jan 14 26.2 725.6 723.9 735.8 616.3 624.6 2409.4
ThroughputGbytesPerHour 25 Jan 14 0.8 16.6 16.6 16.9 14.1 14.3 55.3
ReliableAndStableAssessment TRUE Jan 14 - TRUE TRUE TRUE TRUE TRUE TRUE
NumberOfFailedFiles 0 Jan 14 - 3* 3* 3* 3* 3* 3*
NumberOfFailedFilesAcceptable -
Jan 14 - TRUE TRUE TRUE TRUE TRUE TRUE
      1000 files only 41963 files
41963 files 41963 files 41963 files 41963 files 41963 files
     
See notes 0 & 2
See notes 0 & 4
See note 1
See notes 0 & 2
See note 1
See notes 1 & 3

Notes    
Note 0:    For the Fedora and Webdav runs, the runtime includes recovering the file across the network and posting the migrated file back across the network
Note 1:    Copying data from NAS to HDFS took 08:03 (hh:mm). Copying processed data from HDFS to the NAS will also take time but was not measured.  None of the copying time is included in TotalRuntime
Note 2:    Fedora Commons hosted on a VM, retreiving files from the NAS and serving them to the Hadoop job
Note 3:    When using Kakadu, the migrated JP2 files have slightly lower PSNR values, thus threshold was lowered from 50 for OpenJPEG, to 48, so files would pass
Note 4:    Creating a directory in a webdav folder is expensive, therefore, all output files are put in to one directory (same as for HDFS)
Note 5:    One of the files failed after going through tifftopnm, Kakadu didn’t like the input - "Image file for component 0 terminated prematurely!"

We can meet, and exceed our 40 hour target, by using CommandLineJob-Kakadu as the workflow.  That workflow is twice as fast as our metric goal, meaning we could process our entire collection in less than a month on our Hadoop instance.

* The three failed files either failed to migrate or failed the QA step and this was correctly reported by each workflow.  Those files can now be investigated as to why they did not migrate successfully.  They did not stop the execution of the workflow.

Note: Metrics must be registered in the metrics catalogue

Assessment of non-measurable points

For some evaluation points it makes most sense to a textual description/explanation

Please include a note about goals-objectives omitted, and why.

Technical details

Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)

Taverna workflow: http://www.myexperiment.org/workflows/3401.html

Source code: https://github.com/bl-dpt/chutney-hadoopwrapper/commit/73378803e9838ff7a17fc49b5407231a48ac99a7

*Platform: http://wiki.opf-labs.org/display/SP/BL+Hadoop+Platform*

Fedora version used: 3.6(?)

Evaluation notes

Could be such things as identified issues, workarounds, data preparation, if not already included above

Conclusion

The various ways in which Taverna and Hadoop could be used together were investigated within this experiment.  It is interesting to note that execution speed of the workflow when recovering and storing files remotely from the cluster only took a fraction longer than when the files were stored in HDFS.  Since the files took 8 hours to be copied in, it may not make sense to cache files in HDFS before processing in this instance, unless the files are already stored there.  Due to the execution speed of the various migration workflows we can deduce that we could migrate files from a remote repository, using CommandLineJob/Kakadu, within roughly half the total time goal we were aiming for.  In that case we would not need to copy files to HDFS for processing in advance of executing the workflow.  Should the files be smaller, or on a slower network, then the results will differ and consideration should be given to caching the files in HDFS, or storage that is more local.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.