View Source

h2. Evaluator(s)

[Bolette Jurik|https://portal.ait.ac.at/sites/Scape/_layouts/userdisp.aspx?ID=59] (SB)

h2. Evaluation points

In this testbed experiment we focus on performance. The earlier experiment [EVAL-LSDR6-1|SP:EVAL-LSDR6-1] on mp3 to wav migration and QA using xcorrSound also focused on correctness. Moving the workflow to Hadoop to prove scalability, should not affect correctness of the tool.
* *Scalability* The workflow must be able to process a large collection within reasonable time. That is we want to be able to migrate and QA a large collection of radio broadcast mp3-files (20 Tbytes - 175.000 files) within weeks rather than years. The goal of 1000 for _NumberOfObjectsPerHour_ would mean that we can migrate the 20TB radio broadcast mp3 collection in a week.
* *Reliability* The workflow must run reliably without failing on a large number of files, and it must be possible to restart the workflow without loosing work.
* *Correctness/Scalability* We must believe to some extent that the automatic QA correctly identifies the "questionable" migrations, such that these can be checked in a manual QA process. We must however also insist that the number of migrations to check manually is minimal, as this is a very resource demanding process. The goal for _QAFalseDifferentPercent_ has been changed to 2%. This means that we would have to check 3500 migrated 2 hour wav files manually. This is already too resource demanding. However the poor quality of the original files is a great challenge for the content comparison tool, and it turns out this is also too much to ask\!

This table shows the results of the evaluation of the mp3 to wav migration and QA performed on the SB Hadoop Cluster.



h5. Assessment of measurable points

|| Metric || Description || Metric baseline || Metric goal || 2014 April 8th\* || 2014 June 17th-23rd*\*\\ ||
| NumberOfObjectsPerHour | *Performance efficiency - Capacity / Time behaviour* | 18 (9th-13th November 2012) | 1000 | 204 \\ | 223 |
| QAFalseDifferentPercent | *Functional suitability - Correctness* | 0.412 % (5th-9th November 2012) | 2% | \\ | \~8.7 |


\*Based on the small experiment from April with max split size 128 below.

\**Based on the large scale experiment from June below.

The conclusion is that the workflow does scale\! We will not be able to migrate the collection in 1 week, but we will be able to do it in one month on the SB Hadoop cluster, which is considerably better than the one year needed without Hadoop (last evaluation). The conclusion on correctness is really more of a discussion... The measure _QAFalseDifferentPercent_ is defined as



h6. Small Experiments April 2014

All run on a file list of *58 files (7.2Gb in total)*.

|| max-split-size || duration \\ || launched maps ||
| 1024 \\ | 37m, 58.593s = 2278.593s \\ | 3,3,7 \\ |
| 512 \\ | 24m, 1.9s = 1441.9s \\ | 6,6,14 \\ |
| 256 \\ | 18m, 17.917 = 1097.917 \\ | 12,12,28 \\ |
| 128 \\ | 17m, 3.176 = 1023.176 \\ | 24,24,57 \\ |
| 64 \\ | 16m, 54.703s = 1014.703s \\ | 47,47,113 \\ |
| 32 \\ | 17m, 29.96s = 1049.96 \\ | 93,93,225 \\ |
The small experiments were mainly run to decide an optimal max-split-size (or an optimal number of map-reduce map tasks). The exact number of MR map tasks seem not to have a big influence on performance, as long as we have more than 12. That is as long as max split size is at most 256 on an input file list of 58 files. We note that we get approximately twice as many launched maps for the waveform-compare Hadoop job, simply because the input list is approximately twice as big, as it is a list of pairs. We can of course adjust this to get approximately the same number of jobs, but it does not seem to be important for the performance.

The large scale experiments were held up for a while, due to too few connection to storage. As this job writes very much data, the Isilon disk I/O and CPU use were being maxed out, even though we were trying to "play nice" and only run 24 maps concurrently. The number of connections to the 16 nodes Isilion storage solution at SB were 2 when the small scale experiments were run. It was then set up to five connections before we ran the large scale experiments.



h6. Large Scale Experiments June 2014

This line of tests will focus on scalability. If max-split-size=256 (bytes) gives us 2*12 maps on an input-txt-file with 58 files, this means 256 bytes is approx 58/12=4,8333 files, so one file is approx 256/4.8333=52.9655 bytes. Then if we want approx 2*12 maps on an input-txt-file with 1000 files, we want max-split-size to be approx 1000/12*52.9655 = 4413.7931 bytes.

The jobs were run on file-lists of approximately 1000 files (100MB); the max.split.size was set to 4414; and each job writes approximately 3.1TB of intermediate and output wav files (\+ some small log files).


|| date || size:#mp3s || total size || duration || total duration || NumberOfObjectsPerHour || failure || total failure || QAFalseDifferentPercent ||
| 2014 Jun 17 | 1000 | 1000 | 4h, 33m | 4h, 33m | 220\\ | 63 | 63 | 6.3\\ |
| 2014 Jun 18 | 1000 | 2000 | 4h, 23m | 8h, 56m | 224\\ | 111 | 174 | 8.7\\ |
| 2014 Jun 19 | 999 | 2999 | 4h, 20m | 13h, 29m | 222\\ | 52 | 226 | \~7.5\\ |
| 2014 Jun 20 | 1000 | 3999 | 4h, 27m | 17h, 56m | 223\\ | 142 | 368 | \~9.2\\ |
| 2014 Jun 23 | 999 | 4998 | 4h, 28m | 22h, 24m | 223\\ | 67 | 435 | \~8.7\\ |



h5. Assessment of non-measurable points

_ReliableAndStableAssessment_ *{_}Reliability - Runtime stability{_}*

_For some evaluation points it makes most sense to a textual description/explanation_

h5. A note about goals-objectives omitted, and why

This evaluation covers _performance_, _reliability_ and _functional suitability_ to some extent_.\_ We did not look at the metrics _MaxObjectSizeHandledInGbytes_ and _MinObjectSizeHandledInMbytes._ These measures would certainly contribute to the evaluation. Our collection ([Danish Radio broadcasts, mp3|../../../../../../../../../../display/SP/Danish+Radio+broadcasts%2C+mp3|||\||]) has mp3 files varying very little in size (approx. 2 hours, average file size 118Mb, largest file: 124Mb) and the workflow thus produces wav files varying very little in size (around 1.4Gb \*2 per mp3 file). The test mp3 files used under development were of course considerably smaller (around 7Mb) and produced smaller output (around 50Mb*2 per mp3 file).

We did also not look at the metrics _ThroughputGbytesPerMinute_, _ThroughputGbytesPerHour_, or _AverageRuntimePerItemInHours_. These are all possible to compute fairly easily though.


WORK IN PROGRESS


The evaluation does not cover
* Organisational maturity
* Maintainability
* Planning and monitoring efficiency
* Commercial readiness

h2. Technical details


_Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)_

h5. WebDAV

We would like to store sufficient information about an experiment (hadoop program, configuration, etc.), so we are able to rerun it. For this purpose, ONB is providing a WebDAV - if you have questions and need more information, please contact Sven or Reinhard at ONB.
Taverna workflows will still be stored on [myexperiment.org|http://www.myexperiment.org].

Link: [http://fue.onb.ac.at/scape-tb-evaluation|http://fue.onb.ac.at/scape-tb-evaluation]

Please use the following structure for storing experiment results

{code}
http://fue.onb.ac.at/scape-tb-evaluation/{institutionid}/{storyid}/{experimentid}/{timestamp}/

Example:
http://fue.onb.ac.at/scape-tb-evaluation/onb/arc2warc/jwat/1374526050/

where institutionid = onb, storyid = arc2warc, experimentid = jwat, timestamp = 1374526050
{code}


h2. Evaluation notes

_Could be such things as identified issues, workarounds, data preparation, if not already included above_

QAFalseDifferentPercent