h2. Investigator(s)
William Palmer, British Library
h2. Dataset
BL 19th Century Newspapers: [SP:BL 19th Century Digitized Newspapers]
h2. Platform
BL Hadoop Platform: [SP:BL Hadoop Platform]
h2. Workflow
The workflow has been implemented as a Taverna workflow, in Java code and as a batch file.
The latest Taverna workflow is here: [http://www.myexperiment.org/workflows/3401.html]
!http://www.myexperiment.org/workflows/3401/versions/2/previews/full|border=1,width=514,,height=400!
Latest Java code, workflow and batch files are here: [https://github.com/bl-dpt/chutney-hadoopwrapper/]
The latest (Taverna/Java) workflow contains the following steps:
* \*Recover TIFF file from storage (HDFS/Fedora/Webdav)
* Run Exiftool to extract metadata from TIFF
* Migrate TIFF->JP2 (using OpenJPEG/Kakadu)
* Run Exiftool to extract metadata from JP2
* Run Jpylyzer over the JP2
* \*Run Schematron validator over Jpylyzer outputs to validate conformance of migrated image to the specified profile
* Use ImageMagick to compare TIFF and JP2
* \*Create report
* Create output package (JP2, results, etc)
* Post files back to relevant storage (see above)
Note that steps marked * are not performed in the batch workflow.
h2. Requirements and Policies
NumberOfObjectsPerHour >= 1600 (This assumes we want to process the entire collection within 2 months).
ThroughputGbytesPerHour >= 25 (This assumes we want to process the entire collection within 2 months).
OrganisationalFit = "Can this workflow/solution/components be applied and used at the BL? Are the components using supported technology? etc."
NumberOfFailedFiles = 0 (We can probably lose speed, but we cannot without question lose files)
h2. Evaluations
[http://wiki.opf-labs.org/display/SP/EVAL-LSDR3-1]
Upcoming evaluations compare several 1TB migrations, with different storage backends and JP2 codecs.
{pageTree:[email protected]}
William Palmer, British Library
h2. Dataset
BL 19th Century Newspapers: [SP:BL 19th Century Digitized Newspapers]
h2. Platform
BL Hadoop Platform: [SP:BL Hadoop Platform]
h2. Workflow
The workflow has been implemented as a Taverna workflow, in Java code and as a batch file.
The latest Taverna workflow is here: [http://www.myexperiment.org/workflows/3401.html]
!http://www.myexperiment.org/workflows/3401/versions/2/previews/full|border=1,width=514,,height=400!
Latest Java code, workflow and batch files are here: [https://github.com/bl-dpt/chutney-hadoopwrapper/]
The latest (Taverna/Java) workflow contains the following steps:
* \*Recover TIFF file from storage (HDFS/Fedora/Webdav)
* Run Exiftool to extract metadata from TIFF
* Migrate TIFF->JP2 (using OpenJPEG/Kakadu)
* Run Exiftool to extract metadata from JP2
* Run Jpylyzer over the JP2
* \*Run Schematron validator over Jpylyzer outputs to validate conformance of migrated image to the specified profile
* Use ImageMagick to compare TIFF and JP2
* \*Create report
* Create output package (JP2, results, etc)
* Post files back to relevant storage (see above)
Note that steps marked * are not performed in the batch workflow.
h2. Requirements and Policies
NumberOfObjectsPerHour >= 1600 (This assumes we want to process the entire collection within 2 months).
ThroughputGbytesPerHour >= 25 (This assumes we want to process the entire collection within 2 months).
OrganisationalFit = "Can this workflow/solution/components be applied and used at the BL? Are the components using supported technology? etc."
NumberOfFailedFiles = 0 (We can probably lose speed, but we cannot without question lose files)
h2. Evaluations
[http://wiki.opf-labs.org/display/SP/EVAL-LSDR3-1]
Upcoming evaluations compare several 1TB migrations, with different storage backends and JP2 codecs.
{pageTree:[email protected]}