William Palmer, British Library
Assessment of measurable points
There are various runs of the migration workflow entered here:
The first part of title details the storage type:
- HDFS: in HDFS on the Hadoop cluster
- Webdav: stored on a NAS local to the Hadoop cluster
- Fedora: stored in/via a Fedora Commons Repository, with object storage on the same NAS as for the Webdav code
The second part of the title described the execution method of the workflow (all using OpenJPEG unless explicitly labelled):
- CommandLineJob: a Java controlled workflow i.e. native MapReduce calling out to external programs as required
- CommandLineJob-Kakadu: same as CommandLineJob, but replacing OpenJPEG for Kakadu
- Taverna: a Taverna workflow, called via the Taverna command line application, calling out to external programs as required
|Metric||Metric goal||Evaluation Date||Metric baseline - Batch on one processing node||Fedora-CommandLineJob||Webdav-CommandLineJob||HDFS-CommandLineJob||Fedora-Taverna||HDFS-Taverna||HDFS-CommandLineJob-Kakadu|
|| Jan 14
|1000 files only|| 41963 files
||41963 files||41963 files||41963 files||41963 files||41963 files|
|| See notes 0 & 2
|| See notes 0 & 4
|| See note 1
|| See notes 0 & 2
|| See note 1
|| See notes 1 & 3
Note 0: For the Fedora and Webdav runs, the runtime includes recovering the file across the network and posting the migrated file back across the network
Note 1: Copying data from NAS to HDFS took 08:03 (hh:mm). Copying processed data from HDFS to the NAS will also take time but was not measured. None of the copying time is included in TotalRuntime
Note 2: Fedora Commons hosted on a VM, retreiving files from the NAS and serving them to the Hadoop job
Note 3: When using Kakadu, the migrated JP2 files have slightly lower PSNR values, thus threshold was lowered from 50 for OpenJPEG, to 48, so files would pass
Note 4: Creating a directory in a webdav folder is expensive, therefore, all output files are put in to one directory (same as for HDFS)
Note 5: One of the files failed after going through tifftopnm, Kakadu didn’t like the input - "Image file for component 0 terminated prematurely!"
We can meet, and exceed our 40 hour target, by using CommandLineJob-Kakadu as the workflow. That workflow is twice as fast as our metric goal, meaning we could process our entire collection in less than a month on our Hadoop instance.
* The three failed files either failed to migrate or failed the QA step and this was correctly reported by each workflow. Those files can now be investigated as to why they did not migrate successfully. They did not stop the execution of the workflow.
Note: Metrics must be registered in the metrics catalogue
Assessment of non-measurable points
For some evaluation points it makes most sense to a textual description/explanation
Please include a note about goals-objectives omitted, and why.
Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)
Taverna workflow: http://www.myexperiment.org/workflows/3401.html
Source code: https://github.com/bl-dpt/chutney-hadoopwrapper/commit/73378803e9838ff7a17fc49b5407231a48ac99a7
Fedora version used: 3.6(?)
Could be such things as identified issues, workarounds, data preparation, if not already included above
The various ways in which Taverna and Hadoop could be used together were investigated within this experiment. It is interesting to note that execution speed of the workflow when recovering and storing files remotely from the cluster only took a fraction longer than when the files were stored in HDFS. Since the files took 8 hours to be copied in, it may not make sense to cache files in HDFS before processing in this instance, unless the files are already stored there. Due to the execution speed of the various migration workflows we can deduce that we could migrate files from a remote repository, using CommandLineJob/Kakadu, within roughly half the total time goal we were aiming for. In that case we would not need to copy files to HDFS for processing in advance of executing the workflow. Should the files be smaller, or on a slower network, then the results will differ and consideration should be given to caching the files in HDFS, or storage that is more local.