compared with
Current by Bolette Ammitzbøll Jurik
on Jul 28, 2014 10:17.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (62)

View Page History
{toc}

h2. Evaluator(s)

h2. Evaluation points

h5. Assessment of measurable points
In this testbed experiment we focus on performance. The earlier experiment [EVAL-LSDR6-1|SP:EVAL-LSDR6-1] on mp3 to wav migration and QA using xcorrSound also focused on correctness. Moving the workflow to Hadoop to prove scalability, should not affect correctness of the tool.
* *Scalability* The workflow must be able to process a large collection within reasonable time. That is we want to be able to migrate and QA a large collection of radio broadcast mp3-files (20 Tbytes - 175.000 files) within weeks rather than years. The goal of 1000 for _Number Of Objects Per Hour_ (or 0.28 for _number of objects per second_) would mean that we can migrate the 20TB radio broadcast mp3 collection in a week.
* *Reliability* The workflow must run reliably without failing on a large number of files, and it must be possible to restart the workflow without loosing work.
* *Correctness/Scalability* We must believe to some extent that the automatic QA correctly identifies the "questionable" migrations, such that these can be checked in a manual QA process. We must however also insist that the number of migrations to check manually is minimal, as this is a very resource demanding process. The goal for _QAFalseDifferentPercent_ has been changed to 2%. This means that we would have to check 3500 migrated 2 hour wav files manually. This is already too resource demanding. However the poor quality of the original files is a great challenge for the content comparison tool, and it turns out this is also too much to ask\!

|| Metric || Description || Metric baseline || Metric goal || 2014 April 8th\* ||
| NumberOfObjectsPerHour | *Performance efficiency - Capacity / Time behaviour* | 18 (9th-13th November 2012) | 1000 | 204 \\ |
| NumberOfFailedFiles | *Reliability - Runtime stability* | 0 | 0 | 0 |
| QAFalseDifferentPercent | *Functional suitability - Correctness* | 0.412 % (5th-9th November 2012) | 0.412 % | 82.76 % \\ |
| | | | 2Gb | |
This table shows the results of the evaluation of the mp3 to wav migration and QA performed on the SB Hadoop Cluster.


\*Based on the small experiment with max split size 128 below. See explanation for the abysmal correctness score.

h3. Assessment of measurable points

h6. Small Experiments
|| Metric || Description || Metric baseline || Metric goal || Evaluation 2014 April 8th\* || Evaluation 2014 June 17th-23rd*\* \\ ||
| [number of objects per second|http://www.purl.org/DP/quality/measures#418] | *Performance efficiency - Capacity / Time behaviour* \\
Number of objects that can be processed per second | 0.005 | 0.28 \\ | 0.0567 | 0.0619 |
| Number Of Objects Per Hour**\* \\ | *Performance efficiency - Capacity / Time behaviour* \\
Number of objects that can be processed per second | 18 (9th-13th November 2012) | 1000 | 204 \\ | 223 |
| [QAFalseDifferentPercent|http://ifs.tuwien.ac.at/dp/vocabulary/quality/measures#416] | *Functional suitability - Correctness* \\
Ratio of 'QA decided different'/'human judged same', \\
that is ratio of content comparisons resulting in original and migrated different, \\
even though human evaluation define original and migrated similar 0.412 % | 0.412 % (5th-9th November 2012) | 2% | \\ | \~8.7% |


\*Based on the small experiment from April with max split size 128 below.

\**Based on the large scale experiment from June below.

\***This measure is not defined in the [Metrics Catalogue|http://ifs.tuwien.ac.at/dp/vocabulary/quality/measures], but we have kept it as a more readable extra supplement to _number of objects per second_.



h4. Discussion and Conclusion

The conclusion is that the workflow does scale\! We will not be able to migrate the collection in 1 week, but we will be able to do it in one month on the SB Hadoop cluster, which is considerably better than the one year needed without Hadoop (last evaluation).

The conclusion on correctness is really more of a discussion... The measure _QAFalseDifferentPercent_ is defined as "Ratio of 'QA decided different'/'human judged same', that is ratio of content comparisons resulting in original and migrated different, even though human evaluation define original and migrated similar". As the large scale evaluations were performed on .5TB input, comprising 4998 2-hour mp3 files or 416.5 days, that is over a year of audio, we did not annotate the input, which means that \~8.7% is the number of files, where our audio content comparison tool reported that content of the original and the migrated files were *not* similar. Some of these may be actual migration errors. Most of them we however believe is due to the poor quality of the original material. Some of the original mp3 files have long periods of "nothing recorded" or silence. Silence or "almost silence" is very difficult to compare, and the tool will report not similar on these files. A better output would probably be "too much silence to perform content comparison". I would like to refer to the correctness evaluation of the xcorrSound waveform-compare tool done last year instead in section _2.4 Correctness based Benchmarks and Validation Tests_ of [deliverable D11.2 Quality Assurance Workflow, Release 2 + Release Report|http://www.scape-project.eu/deliverable/d11-2-quality-assurance-workflow-release-2-release-report].


h4. Small Experiments April 2014

All run on a file list of *58 files (7.2Gb in total)*.

|| max split size || duration \\ || launched maps || success \\ || failure \\ ||
|| max-split-size || duration \\ || launched map tasks \\
on the three Hadoop jobs  ||
| 1024 \\ | 37m, 58.593s = 2278.593s \\ | 3,3,7 \\ | 18 \\ | 40 \\ |
| 512 \\ | 24m, 1.9s = 1441.9s \\ | 6,6,14 \\ | 0 \\ | 58 \\ |
| 256 \\ | 18m, 17.917 = 1097.917 \\ | 12,12,28 \\ | 0 \\ | 58 \\ |
| 128 \\ | 17m, 3.176 = 1023.176 \\ | 24,24,57 \\ | 10 \\ | 48 \\ |
| 64 \\ | 16m, 54.703s = 1014.703s \\ | 47,47,113 \\ | 0 \\ | 58 \\ |
| 32 \\ | 17m, 29.96s = 1049.96 \\ | 93,93,225 \\ | 4 \\ | 54 \\ |
The big question is why we get so many failures? The answer is of course that the list of pairs of files to compare is wrong\! This list is created by Taverna beanshells, and we are missing a correct sort of the two output lists from the FFmpeg and mpg321 Hadoop jobs, before we combine the lists to a list of pairs as input to the waveform-compare Hadoop job. This has now been fixed.

The exact number of MR maps seem not to have a big influence on performance, as long as we have more than 12. That is as long as max split size is at most 256 on an input file list of 58 files. We note that we get approximately twice as many launched maps for the waveform-compare Hadoop job, simply because the input list is approximately twice as big, as it is a list of pairs. We can of course adjust this to get approximately the same number of jobs, but it does not seem to be important for the performance.
The small experiments were mainly run to decide an optimal max-split-size (or an optimal number of map-reduce map tasks). A split is a part of an input file that one map task is working on. Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. When working on text files, the default of letting the number of map tasks depend on the number of DFS blocks in the input files works well. We are not working on text files. The input to our map-reduce jobs are text files, but very small text files only containing lists of paths to the audio files, we actually want to work on. We thus want much smaller splits of only a few lines each. The max-split-size (*mapred.max.split.size*) is the maximum size of such a split in bytes.

The exact number of MR map tasks seem not to have a big influence on performance, as long as we have more than 2*12. That is as long as max split size is at most 256 on an input file list of 58 files. We note that we get approximately twice as many launched maps for the waveform-compare Hadoop job, simply because the input list is approximately twice as big, as it is a list of pairs. We can of course adjust this to get approximately the same number of jobs, but as the two first jobs _FfmpegMigrate_ and _Mpg321Convert_ run simultaneously, and the _WaveformCompare_ job runs alone, we actually have approximately the same number of map tasks throughout the workflow.

The first line of tests were to decide on expected optimal max split size.
The large scale experiments were held up for a while, due to too few connection to storage. Remember we are using the [SP:SB Hadoop Platform]. As this job writes very much data, the Isilon disk I/O and CPU use were being maxed out, even though we were trying to "play nice" and only run 28 maps concurrently. The number of connections to the 16 nodes Isilon storage solution at SB were 2 when the small scale experiments were run. It was then set up to five connections before we ran the large scale experiments.

The next line of tests will vary on the size of the input. If max-split-size=256 (bytes) gives us 2*12 maps on an input-txt-file with 58 files, this means 256 bytes is approx 58/12=4,8333 files, so one file is approx 256/4.8333=52.9655 bytes. Then if we want approx 2*12 maps on an input-txt-file with 1000 files, we want max-split-size to be approx 1000/12*52.9655 = 4413.7931 bytes.

The current hold-up is hardware. As this job writes very much data, the Isilon disk I/O and CPU use is being maxed out, even though we are trying to "play nice" and only run 24 maps concurrently. We hope this can be solved by more network cables.

h4. Large Scale Experiments June 2014

h5. Assessment of non-measurable points
This line of tests will focus on scalability. If max-split-size=256 (bytes) gives us 2*12 maps on an input-txt-file with 58 files, this means 256 bytes is approx 58/12=4,8333 files, so one file is approx 256/4.8333=52.9655 bytes. Then if we want approx 2*12 maps on an input-txt-file with 1000 files, we want max-split-size to be approx 1000/12*52.9655 = 4413.7931 bytes.

_ReliableAndStableAssessment_ *{_}Reliability - Runtime stability{_}*
The jobs were run on file-lists of approximately 1000 files (129GB); the max.split.size was set to 4414; and each job writes approximately 3.1TB of intermediate and output wav files (\+ some small log files).

_For some evaluation points it makes most sense to a textual description/explanation_

h5. A note about goals-objectives omitted, and why
|| date || size:#mp3s || total size || duration || total duration || NumberOfObjectsPerHour || failure || total failure || QAFalseDifferentPercent ||
| 2014 Jun 17 | 1000 | 1000 (129GB) | 4h, 33m | 4h, 33m | 220 \\ | 63 | 63 | 6.3 \\ |
| 2014 Jun 18 | 1000 | 2000 (258GB) | 4h, 23m | 8h, 56m | 224 \\ | 111 | 174 | 8.7 \\ |
| 2014 Jun 19 | 999 \\ | 2999 (387GB) | 4h, 20m | 13h, 29m | 222 \\ | 52 | 226 | \~7.5 \\ |
| 2014 Jun 20 | 1000 | 3999 (516GB) \\ | 4h, 27m | 17h, 56m | 223 \\ | 142 | 368 | \~9.2 \\ |
| 2014 Jun 23 | 999 | 4998 (645GB) | 4h, 28m | 22h, 24m | 223 \\ | 67 | 435 | \~8.7 \\ |

This evaluation covers _performance_, _reliability_ and _functional suitability_ to some extent_.\_ We did not look at the metrics _MaxObjectSizeHandledInGbytes_ and _MinObjectSizeHandledInMbytes._ These measures would certainly contribute to the evaluation. Our collection ([Danish Radio broadcasts, mp3\|../../../../../../../../../../display/SP/Danish+Radio+broadcasts%2C+mp3|]) has mp3 files varying very little in size (approx. 2 hours, average file size 118Mb, largest file: 124Mb) and the workflow thus produces wav files varying very little in size (around 1.4Gb \*2 per mp3 file). The test mp3 files used under development were of course considerably smaller (around 7Mb) and produced smaller output (around 50Mb*2 per mp3 file).

We did also not look at the metrics _ThroughputGbytesPerMinute_, _ThroughputGbytesPerHour_, or _AverageRuntimePerItemInHours_. These are all possible to compute fairly easily though.

h3. Assessment of non-measurable points

WORK IN PROGRESS
In the last evaluation, we did include _ReliableAndStableAssessment_ *{_}Reliability - Runtime stability{_}* in the evaluation points, and we wrote *true* both in goal and in baseline value (Manual assessment: the experiment performed reliably and stably for 13 days, but then Taverna failed with  java.lang.OutOfMemoryError: Java heap space due to /tmp/ being filled up. All results were however saved, and the workflow could simply be restarted with a new starting point in the input list). This measure is not a part of the scape metrics catalogue, but [stability judgement|http://purl.org/DP/quality/measures#108] is and an evaluation follows here.


The experiment performed reliable and stably for around 4 hours. I will however note that this experiment was not focused on reliability, and all intermediate results are potentially lost if the workflow is killed. I will also note that we partitioned the input to the workflow, so it worked on only 1000 files at a time. This was done as the test environment had on upper limit on available storage, and the workflow produces approximately 3.1TB of output files for each 1000 input files. The workflow will fail if it does not have enough output storage. Working on only 1000 files at a time of course has the benefit, that only 1000 results can be lost at a time, and as the workflow seems to run stably for this size input it is reliable and stable in this configuration. Using this configuration however means that for a 20TB 175000 file collection, I need 175 input files and a script that starts the workflow 175 times sequentially (and roughly .5 Petabyte available storage).



h3. A note about goals-objectives omitted, and why

This evaluation covers _performance_, _reliability_ and _functional suitability_ to some extent. We did not look at the metrics _[max object size handled in bytes|http://purl.org/DP/quality/measures#404]_ and _[min object size handled in bytes|http://purl.org/DP/quality/measures#405]__._ These measures would certainly contribute to the evaluation. Our collection ([SP:Danish Radio broadcasts, mp3]) has mp3 files varying very little in size (approx. 2 hours, average file size 118Mb, largest file: 135Mb) and the workflow thus produces wav files varying very little in size (2 wav files of around 1.4Gb for one 118Mb mp3 file). The test mp3 files used under development were of course considerably smaller (around 7Mb) and produced smaller output (around 50Mb*2 per mp3 file). We think that the workflow can handle larger files as well, but this was not tested. We can report that for input _min object size handled in bytes_ is around 7Mb (7000000 bytes) and _max object size handled in bytes_ is around 135Mb (135000000 bytes). For output _min object size handled in bytes_ is around 50Mb (50000000 bytes) and _max object size handled in bytes_ is around 1.4Gb (1400000000 bytes). This would be an interesting measure to experiment further with.


We did also not look at the metrics _[throughput in bytes per second|http://purl.org/DP/quality/measures#406]__._ This measure can be computed from _number of objects per second_ or _Number Of Objects Per Hour_. The evaluation 2014 June 17th-23rd gave us _Number Of Objects Per Hour_=223. To compute throughput in bytes per second, we need the throughput size. Our question here is what throughput means. We wrote that the 1000 files in input were only approximately 129GB but they produced 3.1TB of intermediate and output wav files. Half of these (1.55TB) is output, and we will use this as the throughput size. Then for _Number Of Objects Per Hour_=223, we get 1.55/1000*223 = 0.34565TB or 0.34565×1024 = 353.9456Gb of throughput per hour, that is 353945600000 / 60 / 60 = 98318222 bytes or 98 MB of throughput per second.


The evaluation does not cover
* Organisational maturity
* Commercial readiness

This experiment was focused on tools and platform and performance, and we will keep the evaluations to the specific experiment.


h2. Technical details


_Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)_
The workflow that was used is version 4 of the _Slim Migrate And QA mp3 to Wav Using Hadoop Jobs_ workflow available from [http://www.myexperiment.org/workflows/4080.html]

h5. WebDAV

We would like to store sufficient information about an experiment (hadoop program, configuration, etc.), so we are able to rerun it. For this purpose, ONB is providing a WebDAV - if you have questions and need more information, please contact Sven or Reinhard at ONB.
Taverna workflows will still be stored on [myexperiment.org|http://www.myexperiment.org].
The Hadoop jobs that were used are from commit e1ec47d of the [https://github.com/statsbiblioteket/scape-audio-qa-experiments] project.

Link: [http://fue.onb.ac.at/scape-tb-evaluation|http://fue.onb.ac.at/scape-tb-evaluation]
The waveform-compare tool that was used was from xcorrSound release v2.0.2 [https://github.com/openplanets/scape-xcorrsound/releases/tag/v2.0.2].

Please use the following structure for storing experiment results
The ffmpeg used was version 0.10 Copyright (c) 2000-2012 the FFmpeg developers built on Mar  9 2012 09:32:12 with gcc 4.4.6 20110731 (Red Hat 4.4.6-3).

{code}
http://fue.onb.ac.at/scape-tb-evaluation/{institutionid}/{storyid}/{experimentid}/{timestamp}/
The mpg321 used was version 0.2.10. Copyright (C) 2001, 2002 Joe Drew.

Example:
http://fue.onb.ac.at/scape-tb-evaluation/onb/arc2warc/jwat/1374526050/
The cluster set up that was used was the June 2014 version of the [SP:SB Hadoop Platform].

where institutionid = onb, storyid = arc2warc, experimentid = jwat, timestamp = 1374526050
{code} h3.

h3. WebDAV

The Taverna logs and outputs of the June experiment are stored on [http://fue.onb.ac.at/scape-tb-evaluation/sb/LargeScaleAudioMigration/Mp3ToWavMigrationOnHadoop/] along with the SB scape Hadoop Cluster map-reduce client configuration.



h2. Evaluation notes

_Could be such things as identified issues, workarounds, data preparation, if not already included above_

* We have (stubbornly) kept the old measure _Number Of Objects Per Hour_ in our evaluation, as it is simply easier to read when the processing time is as long as in this experiment.
* QAFalseDifferentPercent was introduced as a measure, when we were working on smaller annotated datasets. When we are working on large scale real life datasets it is problematic. A better idea would probably be to have a _Dissimilar in Percent_ measure along with a _Correctness judgement_ based on the _Dissimilar in Percent_ measure along with prior correctness evaluations on annotated data. We would then also need a discussion of the adequacy of the solution when taken into acount the level of automation and the human resources still needed.
QAFalseDifferentPercent
h2. Conclusion

The conclusion is that we are able to migrate our 20TB mp3 collection to wav including quality assurance in one month on the SB Hadoop Platform. We however need roughly .5 Petabyte available storage, which is not feasible, and we will not do this migration. The xcorrSound waveform-compare tool has proven robust and easy to integrate in a larger workflow, and we will continue maintenance and maybe further development on xcorrSound.