Functional Evaluation
A functional evaluation was performed at the SCAPE developers workshop on 23-25 april at the KB in The Hague.
Version 45cc6bc of the hawarp
tool was used to migrate a batch of 406 ARC files from the KB's web archive to the WARC format, using the KB SCAPE Platform. The following steps were executed:
1. Build the main project.
Some adaptations were done to the pom.xml as to satisfy the fact that a different version of the CDH is used at KB (CDH4) than at ONB (CDH3). Other than indicating the correct version of the Hadoop libraries in the dependencies, no further changes had to be made.
2. Build the module arc2warc-migration-cli, which performs ARC to WARC migration in local mode.
In this step, a minor test failure was detected that was quickly fixed by the developer of the tool. Once the module was built successfully, it was executed against the 406 ARC files of KB. This revealed there are invalid (incorrect payload size) ARC files in the dataset, which causes the tool to exit in local mode. The invalid ARC files will need to be investigated further.
3. Build the module arc2warc-migration-hdp, which performs ARC to WARC migration using Hadoop with files in HDFS.
With this module, it was possible to migrate the full batch of 406 ARC files to WARC, in a single run. Some issues in the documentation regarding the expected input format for the tool were clarified with the tool developer, and the documentation updated accordingly.
4. Build the module droid-identify, to characterise the sample set of ARC files.
5. Build the module tomar-prepare-inputdata, to prepare the ARC files stored on HDFS for processing with the SCAPE ToMaR
tool.
6. In a last step, results from FITS identification shall be ingested into the c3po tool for visualization and presentation purposes (not done yet).
Evaluations
Metric |
PW catalogue URI |
Datatype |
Description |
Example |
Comments |
---|---|---|---|---|---|
NumberOfObjectsPerHour | integer | Number of objects that can be processed per hour |
250 |
Could be used both for component evaluations on a single machine and on entire platform setups |
|
IdentificationCorrectnessInPercent | integer |
Defining a statistical measure for binary evaluations - see detailed specification below | 85 % |
Between 0 and 100 |
|
ThroughputGbytesPerMinute |
integer |
The throughput of data measured in Gybtes per minute |
5 |
Specify in Gbytes per minute |
|
ThroughputGbytesPerHour | integer |
The throughput of data measured in Gbytes per hour |
25 |
Specify in Gbytes per minute |
|
ReliableAndStableAssessment | boolean |
Manual assessment on if the experiment performed reliable and stable |
true |
||
NumberOfFailedFiles | integer |
Number of files that failed in the workflow |
0 |
||
NumberOfFailedFilesAcceptable | boolean | Manual assessment of whether the number of files that fail in the workflow is acceptable |
true |
||
QAFalseDifferentPercent | integer | Number of content comparisons resulting in original and migrated different, even though human spot checking says original and migrated similar. | 5% |
Between 0 and 100 | |
AverageRuntimePerItemInHours |
float | The average processing time in hours per item |
15 |
Positive floating point number |