| Evaluation seq. num.
|| Use only of sub-sequent evaluations of the same evaluation is done in another setup than a previous one.
In that case copy the Evaluation specs table and fill out a new one with a new sequence number.
For the first evaluation leave this field at "1"
|Evaluator-ID||[email protected]|| Unique ID of the evaluator that carried out this specific evaluator.
|Evaluation description||text|| The evaluation of the mp3 to wav migration and QA workflow has three overall goals:
|| Textual description of the evaluation and the overall goals
|Evaluation-Date||DD/MM/YY||13/11/12|| Date of evaluation
||mp3 (128kbit) with Danish Radio broadcasts|| Link to dataset page(s) on WIKI
For each dataset that is a part of an evaluation
make sure that the dataset is described here: Datasets
|Workflow method|| string
||Taverna|| Taverna / Commandline / Direct hadoop etc...
| Workflow(s) involved
||myexperiment Workflow Entry: Mp3 To Wav Migrate QA CLI List Test|| Link(s) to MyExperiment if applicable
| Tool(s) involved
||URL(s)||The workflow uses the following tools|| Link(s) to distinct versions of specific components/tools in the component registry if applicable
|Link(s) to Scenario(s)|| URL(s)
|| LSDRT6 Large scale migration from mp3 to wav
|| Link(s) to scenario(s) if applicable
|Description||String||iapetus.statsbiblioteket.dk||Human readable description of the "platform" - e.g. Bjarnes Linux PC|
|Total number of physical CPUs||integer||2|| Number of CPU's involved
|CPU specs||string|| Intel® Xeon® Processor X5670
(12M Cache, 2.93 GHz, 6.40 GT/s Intel® QPI)
| Specification of CPUs
|Total number of CPU-cores||integer||12|| Number of CPU-cores involved
| Total amount of RAM in Gbytes
||integer||96|| Total amount of RAM on all nodes
| Operating System
||String||Linux||Linux (specific distribution), Windows (specific distribution), other?|
|Storage system/layer||String||NFS mounted files||NFS, HDFS, local files, ?|
metrics must come from / be registered in the metrics catalogue
|Metric||Baseline definition||Baseline value||Goal|| Evaluation 1 (date)
|| Evaluation 2 (date)
|| Evaluation 3 (date)
|NumberOfObjectsPerHour|| Performance efficiency - Capacity / Time behaviour
Number of mp3 files migrated and QA'ed (no manual spot checks). The QA performed as part of the workflow at the time of the baseline test is FFProbe Property Comparison, JHove2 File Format Validation and XCorrSound migrationQA content comparison. The mp3 files are 118Mb on average, and the two wav produced as part of the workflow are 1.4Gb on average. Thus a baseline value of 10 objects per hour means that we process 1.18Gb per hour and we produce 28Gb per hour (+ some property and log files). The collection that we are targeting is 20 Tbytes or 175.000 files. With baseline value we would be able to process this collection in a little over 2 years. The goal value is set so we would be able to process the collection in a week.
Evaluation 1 (9th-13th November 2012). Simple parallelisation. Started two parallel workflows using separate jhove2 installations. Both on the same machine. Processed 879+877 = 1756 files in 4 days, 1 hour and 12 minutes.
| 10 (test 2nd-16th October 2012)
||1000||18 (9th-13th November 2012)|
|ReliableAndStableAssessment|| Reliability - Runtime stability
Manual assessment: the experiment performed reliably and stably for 13 days, but then Taverna failed with java.lang.OutOfMemoryError: Java heap spacedue to /tmp/ being filled up. All results were however saved, and the workflow could simply be restarted with a new starting point in the input list.
| true (assessment October 16th 2012)
|NumberOfFailedFiles|| Reliability - Runtime stability
Files that fail are currently not handled consistently by the workflow, but we have so far not experienced any failed files.
|0 (test 2nd-16th October 2012)||0|
|| Functional suitability - Correctness
This is a measure of how many content comparisons result in original and migrated different, even though the two files sound the same to the human ear. The parallel measure QAFalseSimilarPercent is how many content comparisons result in original and migrated similar, even though the two files sound different to the human ear. We have not experienced this - and we do not expect it to happen. We note that this measure is not improved by testbed improvements, but rather by improvements to the XCorrSound migrationQA content comparison tool in the PC.QA work package. The goal value is set to make manual checking feasible. The collection that we are targeting is 20 Tbytes or 175.000 files. With QAFalseDifferentPercent at .5%, we would still need to check 175 2-hour files manually...
Evaluation 1 (5th-9th November 2012). Processed 728 files in 3 days, 21 hours and 17 minutes = 5597 minutes, which is 5597/728 = 7.7 minutes pr. file in average. The number of files which returned Failure (original and migrated different) is 3 in 728 or 0.412 % of the files. We still need to checked the failed files to see why they failed.
|161 in 3190 ~= 5% (test 2nd-16th October 2012)|| .1%
||0.412 % (5th-9th November 2012)|
We note that we would like to measure QAConfidenceInPercent - how sure are we of the QA? (Functional suitability - Correctness) This evaluation requires a ground truth that is not currently established.