Investigator(s)
William Palmer, British Library
Dataset
BL 19th Century Newspapers: BL 19th Century Digitized Newspapers
Platform
BL Hadoop Platform: BL Hadoop Platform
Workflow
The workflow has been implemented as a Taverna workflow, in Java code and as a batch file.
The latest Taverna workflow is here: http://www.myexperiment.org/workflows/3401.html
Latest Java code, workflow and batch files are here: https://github.com/bl-dpt/chutney-hadoopwrapper/
The latest (Taverna/Java) workflow contains the following steps:
- *Recover TIFF file from storage (HDFS/Fedora/Webdav)
- Run Exiftool to extract metadata from TIFF
- Migrate TIFF->JP2 (using OpenJPEG/Kakadu)
- Run Exiftool to extract metadata from JP2
- Run Jpylyzer over the JP2
- *Run Schematron validator over Jpylyzer outputs to validate conformance of migrated image to the specified profile
- Use ImageMagick to compare TIFF and JP2
- *Create report
- Create output package (JP2, results, etc)
- Post files back to relevant storage (see above)
Note that steps marked * are not performed in the batch workflow.
Requirements and Policies
NumberOfObjectsPerHour >= 1600 (This assumes we want to process the entire collection within 2 months).
ThroughputGbytesPerHour >= 25 (This assumes we want to process the entire collection within 2 months).
OrganisationalFit = "Can this workflow/solution/components be applied and used at the BL? Are the components using supported technology? etc."
NumberOfFailedFiles = 0 (We can probably lose speed, but we cannot without question lose files)
Evaluations
http://wiki.opf-labs.org/display/SP/EVAL-LSDR3-1
Upcoming evaluations compare several 1TB migrations, with different storage backends and JP2 codecs.