View Source

h1. Evaluation specs platform/system level

|| Field \\ || Datatype \\ || Value \\ || Description \\ ||
| Evaluation seq. num. \\ | int \\ | 1 \\ | For the first evaluation leave this field at "1" \\ |
| Evaluator-ID | email | markus.raditsch@onb.ac.at \\ | Unique ID of the evaluator that carried out this specific evaluator. \\ |
| Evaluation describtion | text | The workflow has been implemented as a native JAVA map/reduce application. It uses the Apache Tika™ 1.0 API (detector call) to detect the MIME type of the inputStream for each file inside the ARC.GZ container files. \\
To run over all items inside the ARC.GZ files, the native JAVA map/reduce program uses a custom RecordReader based on the Hadoop 0.20 API. The custom RecordReader enables the program to read the ARC.GZ files natively and iterate over the archive file record by record (content file by content file). Each record is processed by a single map method call to detect its MIME type. \\
\\
Each test ARC.GZ file has a size of approximately 500MB and is the container for around 30000 files. \\
The 100GB test sample (200 x 500MB) is a subset of the original data set produced by a web crawler at the ONB. \\
\\
Output of the map/reduce program is a MIME type distribution list of the analyzed input, containing all identified MIME types plus the occurrence count for each identified MIME type. \\
\\
*Goal / Sub-goal:* \\
\\
Performance efficiency / Throughput \\
) The result has been measured as GB/min/platform \\
\\
Reliability / Stability Indicators \\
) The processing application has been implemented as a JAVA JAR map / reduce application \\
) All needed components (program logic, Hadoop method implementations, dependencies, Apache Tika™ 1.0 JAR) are integrated \\
) The result has been measured "manually" and reflected as a boolean value (true = met the requirements) \\
\\
Reliability / Runtime stability \\
) Use Hadoop admin interface to identify failed tasks. \\
) Use Hadoops output to identify dropped records / any reported errors. \\
) The result has been measured as an integer value reflecting the number of identified run time failures. | Textual description of the evaluation and the overall goals \\ |
| Evaluation-Date | DD/MM/YY | 28/08/12 \\ | Date of evaluation \\ |
| Platform-ID | string \\ | [Platform ONB 1|http://wiki.opf-labs.org/pages/viewpage.action?pageId=16714016] | Unique ID of the platform involved in the particular evaluation - see Platform page included below \\ |
| Dataset(s) | string \\ | 100GB sub set of [Austrian National Library - Web Archive|http://wiki.opf-labs.org/pages/viewpage.action?pageId=5701634] | Link to dataset page(s) on WIKI \\ |
| Workflow method | string \\ | Hadoop map / reduce application implemented in JAVA (jar). | Taverna / Commandline / Direct hadoop etc... \\ |
| Workflow(s) involved \\ | URL(s) \\ | | Link(s) to MyExperiment *if applicable* \\ |
| Tool(s) involved \\ | URL(s) | Hadoop cluster, [tb-wc-hd-archd|https://github.com/openplanets/scape/tree/master/tb-wc-hd-archd], Apache Tika™ 1.0 API | Link(s) to distinct versions of specific components/tools in the component registry *if applicable* \\ |
| Link(s) to Scenario(s) | URL(s) \\ | [http://wiki.opf-labs.org/display/SP/WCT4+Web+Archive+Mime-Type+detection+at+Austrian+National+Library|http://wiki.opf-labs.org/pages/viewpage.action?pageId=12058871] | Link(s) to scenario(s) *if applicable* \\ |

\\
{include:Platform ONB 1}

h1. Evaluation points

metrics must come from / be registered in the [metrics catalogue|Metrics Catalogue]






|| Metric || Baseline definition || Baseline value || Goal || Evaluation 1 (28/08/12) \\ || Evaluation 2 (date) \\ || Evaluation 3 (date) \\ ||
| ThroughputGbytesPerMinute | Virtual machine, Ubuntu Linux, 2GB RAM, Core i5 2,5GHz (single Processor VM configuration), Taverna Workbench workflow, TIKA 0.7 in API mode. | 0,08 | 5 | 16,17 | | |
| ReliableAndStableAssessment | The workflow incorporates different technologies (script, jar, beanshell, Taverna, unix tools) which makes it hard(er) to implement a reliable error handling (compared to a Java map/reduce implementation). | false \\ | true | true | | |
| NumberOfFailedFiles | n/a (much smaller data set) \\ | n/a | 0 | 0 | | |