Develop QA tool, which can detect these out-of-sync error. Note: to evaluate the QA test, manual assessment of passed files is necessary in development, but this should not be necessary in the final solution.
Is there another tool, which can migrate these files? Or can we fix the out-of-sync files?
* For format migration the following tools are available:
The workflow that migrates the files also performs a QA, that convinces us the migrated files do not have out-of-sync errors, and with reasonable performance.
* process 20 files per hour per node
* 99% of files pass automatic QA test
* any 'failed files' should be checked. They should not be 'false positives' (successful migrations with no out-of-sync errors).
links to acutual evaluations of this Issue/Scenario
SO5 Video Migration and QA
In earlier video migrations at SB, we experienced out-of-sync errors. These errors may be due to the sound being shifted. As part of the QA in this workflow, it is therefore explicitly checked if the sound has been 'shifted' and we can find an overlap match. The workflow in short:
wmv files are migrated to video-format X (mpeg-2 for example) using FFmpeg
validation of migrated file format
extraction and comparison of properties of original and migrated file
The xcorrSound QA audio comparison tool compares two sound waves, for instance the audio waves of an original audio file and the audio waves of a migrated file. The tool uses the cross correlation function to find the overlap match. This will give us a match score (between 0 and 1) and also an offset in the second file for the match if the audio has been shifted in the migration (we have examples of this happening). This is not a full solution, but a tool used as part of the workflows in full solutions to a number of issues.
Performance efficiency - Capacity / Time behaviour
NumberOfObjectsPerHour: Number of mp3 files migrated and QA'ed.
The QA performed as part of the issue IS21 Migration of mp3 to wav workflow at the time of the baseline test is FFProbe Property Comparison, JHove2 File Format Validation and XCorrSound migrationQA content comparison. The mp3 files are 118Mb on average, and the two wav files produced as part of the workflow are 1.4Gb on average. Thus a baseline value of 10 objects per hour means that we process 1.18Gb per hour and we produce 28Gb per hour (+ some property and log files). The collection that we are targeting is 20 Tbytes or 175.000 files. With the baseline value of 10 we would be able to process this collection in a little over 2 years. The goal value is set at 1000 so we would be able to process the collection in a week.
Evaluation 1 (9th-13th November 2012). Simple parallelisation. Started two parallel workflows using separate jhove2 installations. Both on the same machine. Processed 879+877 = 1756 files in 4 days, 1 hour and 12 minutes. NumberOfObjectsPerHour=18. Functional suitability - Correctness
QAFalseDifferentPercent: This is a measure of how many content comparisons result in original and migrated different, even though the two files sound the same to the human ear.
The parallel measure QAFalseSimilarPercent is how many content comparisons result in original and migrated similar, even though the two files sound different to the human ear. We have not experienced this - and we do not expect it to happen. We note that this measure is not improved by testbed improvements, but rather by improvements to the XCorrSound migrationQA content comparison tool in the PC.QA work package. The baseline value is 161 in 3190 ~= 5% (test 2nd-16th October 2012). The goal value of 0.1% is set to make manual checking feasible. The collection that we are targeting is 20 Tbytes or 175.000 files. With QAFalseDifferentPercent at 0.5%, we would still need to check 175 2-hour files manually (with the help of the XCorrSound migrationQA tool with the --verbose flag set).
Evaluation 1 (5th-9th November 2012). Processed 728 files in 3 days, 21 hours and 17 minutes = 5597 minutes, which is 5597/728 = 7.7 minutes pr. file in average. The number of files which returned Failure (original and migrated different) is 3 in 728 or 0.412 % of the files. It turns out that these "false negatives" may not be false negatives after all, but actually migration errors that were caught. The actual migration was done with ffmpeg version 0.10, and can be re-created using this version, but it does not happen with ffmpeg versions >= 1.0.
The testbed dataset Danish Radio broadcasts, mp3 which is the basis of issue IS21 Migration of mp3 to wav consists of two hour mp3 files (average file size: 118Mb). When these two hour files are migrated to WAV, we get an average file size around 1.4Gb. In February 2012 the migrationQA workflow did not scale nicely to these two hour sound files, but we have run a test on a file cut to 12Mb (about a tenth of the original size) using dd. The Mp3 to Wav Migrate Validate Compare Workflow (see SO4 Audio mp3 to wav Migration and QA Workflow) used only 34 seconds on the cut file. The migration and the file format validation were successful, but the property comparison reported that the files were not 'close enough'. The reason for this is that cutting the file does not change the header information, so the duration of the original cut file is supposedly 2 hours, 2 minutes and 5.23 seconds, while the duration of the migrated file is 13 minutes and 6.38 seconds. Playing the original cut mp3 using the MPG321 Play mp3 to Wav SCAPE Web Service Workflow (see SO4 Audio mp3 to wav Migration and QA Workflow) used 11.7 seconds. The migrationQA SCAPE Web Service Wav File Comparison Workflow took 1.4 minutes. The result was also negative, but an inspection of the output showed that only the last chunk differed, which probably means that FFmpeg and MPG321 handled the cut off differently.