Skip to end of metadata
Go to start of metadata

Evaluation specs component level

Evaluation seq. num.
Use only of sub-sequent evaluations of the same evaluation is done in another setup than a previous one.
In that case copy the Evaluation specs table and fill out a new one with a new sequence number.
For the first evaluation leave this field at "1"
Evaluator-ID email [email protected] Unique ID of the evaluator that carried out this specific evaluator.
Evaluation description text The evaluation of the mp3 to wav migration and QA workflow has three overall goals:
  • Scalability The workflow must be able to process a large collection within reasonable time. That is we want to be able to migrate and QA a large collection of radio broadcast mp3-files (20 Tbytes - 175.000 files) within weeks rather than years.
  • Reliability The workflow must run reliably without failing on a large number of files, and it must be possible to restart the workflow without loosing work.
  • Correctness We must believe to some extent that the QA is correct. When a migrated file passes the QA, we should be able to say that we are y% certain that the migration was correct. This depends on the individual tools in the workflow.
Textual description of the evaluation and the overall goals
Evaluation-Date DD/MM/YY 13/11/12 Date of evaluation
Dataset(s) string
mp3 (128kbit) with Danish Radio broadcasts Link to dataset page(s) on WIKI
For each dataset that is a part of an evaluation
make sure that the dataset is described here: Datasets
Workflow method string
Taverna Taverna / Commandline / Direct hadoop etc...
Workflow(s) involved
myexperiment Workflow Entry: Mp3 To Wav Migrate QA CLI List Test Link(s) to MyExperiment if applicable
Tool(s) involved
URL(s) The workflow uses the following tools Link(s) to distinct versions of specific components/tools in the component registry if applicable
Link(s) to Scenario(s) URL(s)
LSDRT6 Large scale migration from mp3 to wav
Link(s) to scenario(s) if applicable

Technical setup

Description String Human readable description of the "platform" - e.g. Bjarnes Linux PC
Total number of physical CPUs integer 2 Number of CPU's involved
CPU specs string Intel® Xeon® Processor X5670 
(12M Cache, 2.93 GHz, 6.40 GT/s Intel® QPI)
Specification of CPUs
Total number of CPU-cores integer 12 Number of CPU-cores involved
Total amount of RAM in Gbytes
integer 96 Total amount of RAM on all nodes
Operating System
String Linux Linux (specific distribution), Windows (specific distribution), other?
Storage system/layer String NFS mounted files NFS, HDFS, local files, ?

Evaluation points

metrics must come from / be registered in the metrics catalogue

Metric Baseline definition Baseline value Goal Evaluation 1 (date)
Evaluation 2 (date)
Evaluation 3 (date)
NumberOfObjectsPerHour Performance efficiency - Capacity / Time behaviour
Number of mp3 files migrated and QA'ed (no manual spot checks). The QA performed as part of the workflow at the time of the baseline test is FFProbe Property Comparison, JHove2 File Format Validation and XCorrSound migrationQA content comparison. The mp3 files are 118Mb on average, and the two wav produced as part of the workflow are 1.4Gb on average. Thus a baseline value of 10 objects per hour means that we process 1.18Gb per hour and we produce 28Gb per hour (+ some property and log files). The collection that we are targeting is 20 Tbytes or 175.000 files. With baseline value we would be able to process this collection in a little over 2 years. The goal value is set so we would be able to process the collection in a week.
Evaluation 1 (9th-13th November 2012). Simple parallelisation. Started two parallel workflows using separate jhove2 installations. Both on the same machine. Processed 879+877 = 1756 files in 4 days, 1 hour and 12 minutes.
10 (test 2nd-16th October 2012)
1000 18 (9th-13th November 2012)    
ReliableAndStableAssessment Reliability - Runtime stability
Manual assessment: the experiment performed reliably and stably for 13 days, but then Taverna failed with  java.lang.OutOfMemoryError: Java heap spacedue to /tmp/ being filled up. All results were however saved, and the workflow could simply be restarted with a new starting point in the input list.
true (assessment October 16th 2012)
NumberOfFailedFiles Reliability - Runtime stability
Files that fail are currently not handled consistently by the workflow, but we have so far not experienced any failed files.
0 (test 2nd-16th October 2012) 0      
Functional suitability - Correctness
This is a measure of how many content comparisons result in original and migrated different, even though the two files sound the same to the human ear. The parallel measure QAFalseSimilarPercent is how many content comparisons result in original and migrated similar, even though the two files sound different to the human ear. We have not experienced this - and we do not expect it to happen. We note that this measure is not improved by testbed improvements, but rather by improvements to the XCorrSound migrationQA content comparison tool in the PC.QA work package. The goal value is set to make manual checking feasible. The collection that we are targeting is 20 Tbytes or 175.000 files. With QAFalseDifferentPercent at .5%, we would still need to check 175 2-hour files manually...
Evaluation 1 (5th-9th November 2012). Processed 728 files in 3 days, 21 hours and 17 minutes = 5597 minutes, which is 5597/728 = 7.7 minutes pr. file in average. The number of files which returned Failure (original and migrated different) is 3 in 728 or 0.412 % of the files. We still need to checked the failed files to see why they failed.
161 in 3190 ~= 5% (test 2nd-16th October 2012) .1%
0.412 % (5th-9th November 2012)    

We note that we would like to measure QAConfidenceInPercent - how sure are we of the QA? (Functional suitability - Correctness) This evaluation requires a ground truth that is not currently established.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.