Skip to end of metadata
Go to start of metadata

Functional Evaluation

A functional evaluation was performed at the SCAPE developers workshop on 23-25 april at the KB in The Hague. 

Version 45cc6bc of the hawarp tool was used to migrate a batch of 406 ARC files from the KB's web archive to the WARC format, using the KB SCAPE Platform. The following steps were executed:

1. Build the main project.

Some adaptations were done to the pom.xml as to satisfy the fact that a different version of the CDH is used at KB (CDH4) than at ONB (CDH3). Other than indicating the correct version of the Hadoop libraries in the dependencies, no further changes had to be made.

2. Build the module arc2warc-migration-cli, which performs ARC to WARC migration in local mode. 

In this step, a minor test failure was detected that was quickly fixed by the developer of the tool. Once the module was built successfully, it was executed against the 406 ARC files of KB. This revealed there are invalid (incorrect payload size) ARC files in the dataset, which causes the tool to exit in local mode. The invalid ARC files will need to be investigated further.

3. Build the module arc2warc-migration-hdp, which performs ARC to WARC migration using Hadoop with files in HDFS.

With this module, it was possible to migrate the full batch of 406 ARC files to WARC, in a single run. Some issues in the documentation regarding the expected input format for the tool were clarified with the tool developer, and the documentation updated accordingly.

4. Build the module droid-identify, to characterise the sample set of ARC files.

5. Build the module tomar-prepare-inputdata, to prepare the ARC files stored on HDFS for processing with the SCAPE ToMaR tool.

6. In a last step, results from FITS identification shall be ingested into the c3po tool for visualization and presentation purposes (not done yet).

Evaluations

Metric
PW catalogue
URI
Datatype
Description
Example
Comments
NumberOfObjectsPerHour   integer Number of objects that can be processed per hour
250
Could be used both for component evaluations on a single machine and on entire platform setups
IdentificationCorrectnessInPercent   integer
Defining a statistical measure for binary evaluations - see detailed specification below 85 %
Between 0 and 100
ThroughputGbytesPerMinute
  integer
The throughput of data measured in Gybtes per minute
5
Specify in Gbytes per minute
ThroughputGbytesPerHour   integer
The throughput of data measured in Gbytes per hour
25
Specify in Gbytes per minute
ReliableAndStableAssessment   boolean
Manual assessment on if the experiment performed reliable and stable
true
 
NumberOfFailedFiles   integer
Number of files that failed in the workflow
0
 
NumberOfFailedFilesAcceptable   boolean Manual assessment of whether the number of files that fail in the workflow is acceptable
true
 
QAFalseDifferentPercent   integer Number of content comparisons resulting in original and migrated different, even though human spot checking says original and migrated similar. 5%
Between 0 and 100
AverageRuntimePerItemInHours
  float The average processing time in hours per item
15
Positive floating point number
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.