Evaluation specs platform/system level
Field |
Datatype |
Value |
Description |
|
---|---|---|---|---|
Evaluation seq. num. |
int |
1 |
Evaluation of LSDRT Scenaro 2 - TIFF to JP2 migration and validation of resultant JP2. | |
Evaluator-ID | [email protected] | |
||
Evaluation description | text | The migration of TIFF files to JP2, followed by validation of the new JP2 files using Jpylyzer. The evaluation is to test the processing speed, reliability and correctness of such a migration and the tools used. |
|
|
Evaluation-Date | DD/MM/YYYY | 06/11/2012 | |
|
Platform-ID | string |
Platform BL-0 | |
|
Dataset(s) | string |
30 master TIFF files from JISC1 19th Century Digitised Newspapers (465MB total) | |
|
Workflow method | string |
Hadoop calling command line tools and Java code, one workflow per file. The code consists of two parts - a Java wrapper for Hadoop and a "workflow" style Java class that is executed once per map/file. A text file containing locations of input files is given as input to the wrapper. The wrapper code performs the following, once per input file/map: * Copies file to local temporary storage for processing (from HDFS) * Calls the "workflow" class * Stores outputs from the workflow class in HDFS * Queries the workflow class for success/failure of workflow and reports this in the final overall output from the wrapper (a CSV file: original name, success boolean, output filename) The "workflow" class performs the following: * Checksums the input file (Java code) * Extracts metadata from the input file (Exiftool) * Migrates the input file (OpenJPEG) * Extracts metadata from the output file (Exiftool) * Extracts jpylyzer info from the output file (Jpylyzer) * Checks the jpylyzer output against the Jpeg 2000 profile used to encode the file (Java code) * Generates a short report containing Jpylyzer's isValidJP2 and whether the Jpeg 2000 profiles match (Java code) * Checksums all files (Java code) * Zips all files with a BagIt style structure (Java code) * Output includes a log of all commands lines run, with stdout/stderr from each tool |
|
|
Workflow(s) involved |
URL(s) |
|
||
Tool(s) involved |
URL(s) | Debian "testing" fairly up to date at time of test OpenJPEG - nb. that the 1.3 version in the Debian "testing" repositories does not work with TIFF input files. You need to build the 1.5.1 binaries from source. Hadoop 1.0.4 (Apache compiled .deb) Jpylyzer 1.6.3 (from github, compiled using pyinstaller 2.0) Exiftool (from Debian testing)OpenJDK 6 (from Debian testing) |
||
Link(s) to Scenario(s) | URL(s) |
LSDRT2+Validating+files+migrated+from+TIFF+to+JPEG2000 | |
Platform BL 0
Field |
Datatype |
Value |
Description |
---|---|---|---|
Platform-ID | String | Platform BL 0 | |
Platform description | String | This is a pseudo-distributed single-node Hadoop instance running on a virtual machine on our work laptops and is used for our development. Initial evaluation will be performed on this platform with the long term goal being to run against both experimental DPT platform and using the BL cluster. |
|
Number of nodes | integer | 1 |
|
Total number of physical CPUs | integer | 1 |
|
CPU specs | string | 1 Intel Core i5-2540M CPU @ 2.6GHz |
|
Total number of CPU-cores | integer | 1 |
|
Total amount of RAM in Gbytes |
integer | 2GB |
|
average CPU-cores for nodes |
integer | 1 |
|
avarage RAM in Gbytes for nodes |
integer | 2GB |
|
Operating System on nodes |
String | Debian "testing", fairly current as of test date |
|
Storage system/layer | String | HDFS on virtual disk. |
|
Network layer between nodes | String | n/a | |
|
|
Evaluation points
metrics must come from / be registered in the metrics catalogue
Metric | Baseline definition | Baseline value |
Goal | Evaluation 1 (06-11-2012) |
Evaluation 2 (date) |
Evaluation 3 (date) |
---|---|---|---|---|---|---|
NumberOfObjectsPerHour | Processing speed with shell script |
50 | 1600** | 87.4 | ||
|
||||||
ThroughputGbytesPerHour | Processing speed with shell script |
0.766 | 25** | 1.355 | ||
ReliableAndStableAssessment | Reliability and correctness The workflow completed successfully and no failures were encountered at runtime. However, there is an incompatibility with OpenJPEG and the BL j2k profile: when coder bypass is enabled the outputs of the files show compression artefacts. Also, one converted file failed to open and was corrupt, despite Jpylyzer assessing its headers as valid. This shows that Jpylyzer validation should not be used alone for checking the success or otherwise of the migration. |
|
true | false |
||
OrganisationalFit | true | |
||||
NumberOfFailedFiles |
Reliability No files failed during the workflow. However, when visually reviewing files, one file was found that would not open in various programs, despite Jpylyzer assessing its headers as valid. |
|
0 |
0* |
Previous tests were run on the same platform, but with different data, to compare the relative times taken for the following methods of executing a single command line migration from TIFF to JP2 using OpenJPEG:
- Batch file
- Hadoop - Java class calling migration command line
- Hadoop - Java class executing the migration command line in a Taverna workflow via Taverna command line tool
- Hadoop - Java class executing the migration command line in a Taverna workflow via Taverna Server instance in Tomcat
When looking at average runtime per file this gave an indication of the average overhead per file for each method:
- N/A (baseline)
- 0.69s
- 10.17s
- 25.84s
** The goal values assume that we want to complete the migration of the JISC Newspapers collection (2.2 million images) over two months (60 days) and that the sample data we have used here are representative of the collection as a whole. These values are subject to change.