View Source

h1. Evaluation specs platform/system level

|| Field \\ || Datatype \\ || Value \\ || Description \\ ||
| Evaluation seq. num. \\ | int \\ | 1 \\ | For the first evaluation leave this field at "1" \\ |
| Evaluator-ID | email | [email protected] | Unique ID of the evaluator that carried out this specific evaluator. \\ |
| Evaluation describtion | text | The web archiving team at the Austrian National Library produces information about the content of a web archive during the harvesting process. The result is stored as huge text files. \\
Each line of the log holds the meta data of one object. \\
The application reads the file content line by line, extract the mime type “item 10 (Subtype)” and count all occurrences. \\
\\
*Goal / Sub-goal:* \\
\\
Performance efficiency / Throughput \\
) The processing of these text files is very time consuming and needs parallelized processing \\
) The workflow uses text files produced by the web crawler \\
) The hadoop split size has been 64MB (the default value) \\
) The result has been measured as GB/min/platform \\
\\
Reliability / Stability Indicators \\
) No external tools are used \\
) The processing application has been implemented as a JAVA JAR map / reduce application \\
) All needed components (program logic, Hadoop method implementations, dependencies) are integrated \\
) The result has been measured "manually" and reflected as a boolean value (true = met the requirements) \\
\\
Reliability / Runtime stability \\
) Use Hadoop admin interface to identify failed tasks. \\
) Use Hadoops output to identify dropped records / any reported errors. \\
) The result has been measured as an integer value reflecting the number of identified run time failures. \\ | Textual description of the evaluation and the overall goals \\ |
| Evaluation-Date | DD/MM/YY | 20/08/12 | Date of evaluation \\ |
| Platform-ID | string \\ | [Platform ONB 1|http://wiki.opf-labs.org/pages/viewpage.action?pageId=16714016] | Unique ID of the platform involved in the particular evaluation - see Platform page included below \\ |
| Dataset(s) | string \\ | [Austrian National Library - Web Archive|http://wiki.opf-labs.org/pages/viewpage.action?pageId=5701634]\\ | Link to dataset page(s) on WIKI \\ |
| Workflow method | string \\ | Hadoop map / reduce application implemented in JAVA (jar). \\ | Taverna / Commandline / Direct hadoop etc... \\ |
| Workflow(s) involved \\ | URL(s) \\ | | Link(s) to MyExperiment *if applicable* \\ |
| Tool(s) involved \\ | URL(s) | | Link(s) to distinct versions of specific components/tools in the component registry *if applicable* \\ |
| Link(s) to Scenario(s) | URL(s) \\ | [http://wiki.opf-labs.org/display/SP/WCT8+Huge+text+file+analysis+using+hadoop|http://wiki.opf-labs.org/pages/viewpage.action?pageId=12059899]\\ | Link(s) to scenario(s) *if applicable* \\ |

\\
{include:Platform ONB 1}

h1. Evaluation points

metrics must come from / be registered in the [metrics catalogue|Metrics Catalogue]





















|| Metric || Baseline definition \\ || Baseline value \\ || Goal || Evaluation 1 (20/8/2012) \\ || Evaluation 2 (date) \\ || Evaluation 3 (date) \\ ||
| ThroughputGbytesPerMinute \\ | Serial processing using bash scripts, unix tools and self written java helper tools. Quad core processor 2,66GHz. | 0,35 \\ | 5 \\ | 11,93 \\ | | |
| ReliableAndStableAssessment \\ | The baseline workflow incorporates different technologies (script, jar,unix tools) which makes it hard(er) to implement a reliable error handling (compared to a Java map/reduce implementation). \\ | false \\ | true \\ | true \\ | | |
| NumberOfFailedFiles | Failing on the single input file can be monitored. \\ | 0 \\ | 0 \\ | 0 \\ | | |