Skip to end of metadata
Go to start of metadata

Investigator(s)

William Palmer, British Library

Dataset

BL 19th Century Newspapers: BL 19th Century Digitized Newspapers

Platform

BL Hadoop Platform: BL Hadoop Platform

Workflow

The workflow has been implemented as a Taverna workflow, in Java code and as a batch file.

The latest Taverna workflow is here: http://www.myexperiment.org/workflows/3401.html


Latest Java code, workflow and batch files are here: https://github.com/bl-dpt/chutney-hadoopwrapper/

The latest (Taverna/Java) workflow contains the following steps:

  • *Recover TIFF file from storage (HDFS/Fedora/Webdav)
  • Run Exiftool to extract metadata from TIFF
  • Migrate TIFF->JP2 (using OpenJPEG/Kakadu)
  • Run Exiftool to extract metadata from JP2
  • Run Jpylyzer over the JP2
  • *Run Schematron validator over Jpylyzer outputs to validate conformance of migrated image to the specified profile
  • Use ImageMagick to compare TIFF and JP2
  • *Create report
  • Create output package (JP2, results, etc)
  • Post files back to relevant storage (see above)

Note that steps marked * are not performed in the batch workflow.

Requirements and Policies

NumberOfObjectsPerHour >= 1600 (This assumes we want to process the entire collection within 2 months).
ThroughputGbytesPerHour >= 25 (This assumes we want to process the entire collection within 2 months).
OrganisationalFit = "Can this workflow/solution/components be applied and used at the BL? Are the components using supported technology? etc."
NumberOfFailedFiles = 0 (We can probably lose speed, but we cannot without question lose files)

Evaluations

http://wiki.opf-labs.org/display/SP/EVAL-LSDR3-1

Upcoming evaluations compare several 1TB migrations, with different storage backends and JP2 codecs.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.