View Source

h1. Evaluation specs component level

|| Field \\ || Datatype \\ || Value \\ || Description \\ ||
| Evaluation seq. num. \\ | int \\ | 1 \\ | Use only of sub-sequent evaluations of the same evaluation is done in another setup than a previous one. \\
In that case copy the Evaluation specs table and fill out a new one with a new sequence number. \\
For the first evaluation leave this field at "1" \\ |
| Evaluator-ID | email | [email protected] | Unique ID of the evaluator that carried out this specific evaluator. \\ |
| Evaluation description | text | The evaluation of the _mp3 to wav migration and QA workflow_ has three overall goals: \\
* *Scalability* The workflow must be able to process a large collection within reasonable time. That is we want to be able to migrate and QA a large collection of radio broadcast mp3-files (20 Tbytes - 175.000 files) within weeks rather than years.
* *Reliability* The workflow must run reliably without failing on a large number of files, and it must be possible to restart the workflow without loosing work.
* *Correctness* We must believe to some extent that the QA is correct. When a migrated file passes the QA, we should be able to say that we are y% certain that the migration was correct. This depends on the individual tools in the workflow. | Textual description of the evaluation and the overall goals \\ |
| Evaluation-Date | DD/MM/YY | 13/11/12 | Date of evaluation \\ |
| Dataset(s) | string \\ | [mp3 (128kbit) with Danish Radio broadcasts|Danish Radio broadcasts, mp3] | Link to dataset page(s) on WIKI \\
For each dataset that is a part of an evaluation \\
make sure that the dataset is described here: [SP:Datasets]\\ | |
| Workflow method | string \\ | Taverna | Taverna / Commandline / Direct hadoop etc... \\ | |
| Workflow(s) involved \\ | URL(s) \\ | [myexperiment Workflow Entry: Mp3 To Wav Migrate QA CLI List Test |http://www.myexperiment.org/workflows/3292.html] | Link(s) to MyExperiment *if applicable* \\ |
| Tool(s) involved \\ | URL(s) | The workflow uses the following tools
* [FFmpeg|http://wiki.opf-labs.org/display/TR/FFmpeg]
* [Ffprobe|http://wiki.opf-labs.org/display/TR/Ffprobe]
* [JHOVE2|http://wiki.opf-labs.org/display/TR/JHOVE2]
* [MPG321|http://wiki.opf-labs.org/display/TR/MPG321]
* [xcorrSound|http://wiki.opf-labs.org/display/TR/xcorrSound] | Link(s) to distinct versions of specific components/tools in the component registry *if applicable* \\ |
| Link(s) to Scenario(s) | URL(s) \\ | [LSDRT6 Large scale migration from mp3 to wav|http://wiki.opf-labs.org/display/SP/LSDRT6+Large+scale+migration+from+mp3+to+wav] \\ | Link(s) to scenario(s) *if applicable* \\ |

h1. Technical setup

|| Field \\ || Datatype \\ || Value \\ || Description \\ ||
| Description | String | iapetus.statsbiblioteket.dk | Human readable description of the "platform" - e.g. Bjarnes Linux PC |
| Total number of physical CPUs | integer | 2 | Number of CPU's involved \\ |
| CPU specs | string | Intel® Xeon® Processor X5670  \\
(12M Cache, 2.93 GHz, 6.40 GT/s Intel® QPI) | Specification of CPUs \\ |
| Total number of CPU-cores | integer | 12 | Number of CPU-cores involved \\ |
| Total amount of RAM in Gbytes \\ | integer | 96 | Total amount of RAM on all nodes \\ |
| Operating System \\ | String | Linux | Linux (specific distribution), Windows (specific distribution), other? |
| Storage system/layer | String | NFS mounted files | NFS, HDFS, local files, ? |

h1. Evaluation points

metrics must come from / be registered in the [metrics catalogue|Metrics Catalogue]









|| Metric || Baseline definition || Baseline value || Goal || Evaluation 1 (date) \\ || Evaluation 2 (date) \\ || Evaluation 3 (date) \\ ||
| NumberOfObjectsPerHour | *Performance efficiency - Capacity / Time behaviour* \\
Number of mp3 files migrated and QA'ed (no manual spot checks). The QA performed as part of the workflow at the time of the baseline test is FFProbe Property Comparison, JHove2 File Format Validation and XCorrSound migrationQA content comparison. The mp3 files are 118Mb on average, and the two wav produced as part of the workflow are 1.4Gb on average. Thus a baseline value of 10 objects per hour means that we process 1.18Gb per hour and we produce 28Gb per hour (\+ some property and log files). The collection that we are targeting is 20 Tbytes or 175.000 files. With baseline value we would be able to process this collection in a little over 2 years. The goal value is set so we would be able to process the collection in a week.\\
Evaluation 1 (9th-13th November 2012). Simple parallelisation. Started two parallel workflows using separate jhove2 installations. Both on the same machine. Processed 879+877 = 1756 files in 4 days, 1 hour and 12 minutes.\\ | 10 (test 2nd-16th October 2012) \\ | 1000 | 18 (9th-13th November 2012) | | |
| ReliableAndStableAssessment | *Reliability - Runtime stability* \\
Manual assessment: the experiment performed reliably and stably for 13 days, but then Taverna failed with  java.lang.OutOfMemoryError: Java heap spacedue to /tmp/ being filled up. All results were however saved, and the workflow could simply be restarted with a new starting point in the input list. \\ | true (assessment October 16th 2012) \\ | true | | | |
| NumberOfFailedFiles | *Reliability - Runtime stability* \\
Files that fail are currently not handled consistently by the workflow, but we have so far not experienced any failed files. \\ | 0 (test 2nd-16th October 2012) | 0 | | | |
| QAFalseDifferentPercent \\ | *Functional suitability - Correctness* \\
This is a measure of how many content comparisons result in _original and migrated different_, even though the two files sound the same to the human ear. The parallel measure _QAFalseSimilarPercent_ is how many content comparisons result in _original and migrated similar_, even though the two files sound different to the human ear. We have *not* experienced this - and we do not expect it to happen. We note that this measure is not improved by testbed improvements, but rather by improvements to the XCorrSound migrationQA content comparison tool in the PC.QA work package. The goal value is set to make manual checking feasible. The collection that we are targeting is 20 Tbytes or 175.000 files. With _QAFalseDifferentPercent_ at .5%, we would still need to check 175 2-hour files manually...\\
Evaluation 1 (5th-9th November 2012). Processed 728 files in 3 days, 21 hours and 17 minutes = 5597 minutes, which is 5597/728 = 7.7 minutes pr. file in average. The number of files which returned Failure (original and migrated different) is 3 in 728 or 0.412 % of the files. We still need to checked the failed files to see why they failed. | 161 in 3190 \~= 5% (test 2nd-16th October 2012) | .1% \\ | 0.412 % (5th-9th November 2012) | | |
We note that we would like to measure _QAConfidenceInPercent_ \- how sure are we of the QA? (Functional suitability - Correctness) This evaluation requires a _ground truth_ that is not currently established.