View Source

h2. Investigator(s)

William Palmer, British Library




h2. Dataset

BL 19th Century Newspapers: [SP:BL 19th Century Digitized Newspapers]



h2. Platform

BL Hadoop Platform: [SP:BL Hadoop Platform]



h2. Workflow

The workflow has been implemented as a Taverna workflow, in Java code and as a batch file.

The latest Taverna workflow is here: [http://www.myexperiment.org/workflows/3401.html]

 !http://www.myexperiment.org/workflows/3401/versions/2/previews/full|border=1,width=514,,height=400!
Latest Java code, workflow and batch files are here: [https://github.com/bl-dpt/chutney-hadoopwrapper/]

The latest (Taverna/Java) workflow contains the following steps:


* \*Recover TIFF file from storage (HDFS/Fedora/Webdav)
* Run Exiftool to extract metadata from TIFF
* Migrate TIFF->JP2 (using OpenJPEG/Kakadu)
* Run Exiftool to extract metadata from JP2
* Run Jpylyzer over the JP2
* \*Run Schematron validator over Jpylyzer outputs to validate conformance of migrated image to the specified profile
* Use ImageMagick to compare TIFF and JP2
* \*Create report
* Create output package (JP2, results, etc)
* Post files back to relevant storage (see above)

Note that steps marked * are not performed in the batch workflow.




h2. Requirements and Policies

NumberOfObjectsPerHour >= 1600 (This assumes we want to process the entire collection within 2 months).
ThroughputGbytesPerHour >= 25 (This assumes we want to process the entire collection within 2 months).
OrganisationalFit = "Can this workflow/solution/components be applied and used at the BL? Are the components using supported technology? etc."
NumberOfFailedFiles = 0 (We can probably lose speed, but we cannot without question lose files)

h2. Evaluations

[http://wiki.opf-labs.org/display/SP/EVAL-LSDR3-1]

Upcoming evaluations compare several 1TB migrations, with different storage backends and JP2 codecs.

{pageTree:[email protected]}