Skip to end of metadata
Go to start of metadata

Evaluation specs platform/system level

Evaluation seq. num.
For the first evaluation leave this field at "1"
Evaluator-ID email [email protected]
Unique ID of the evaluator that carried out this specific evaluator.
Evaluation describtion text The workflow has been implemented as a native JAVA map/reduce application. It uses the Apache Tika™ 1.0 API (detector call) to detect the MIME type of the inputStream for each file inside the ARC.GZ container files.
To run over all items inside the ARC.GZ files, the native JAVA map/reduce program uses a custom RecordReader based on the Hadoop 0.20 API. The custom RecordReader enables the program to read the ARC.GZ files natively and iterate over the archive file record by record (content file by content file). Each record is processed by a single map method call to detect its MIME type.

Each test ARC.GZ file has a size of approximately 500MB and is the container for around 30000 files.
The 100GB test sample (200 x 500MB) is a subset of the original data set produced by a web crawler at the ONB.

Output of the map/reduce program is a MIME type distribution list of the analyzed input, containing all identified MIME types plus the occurrence count for each identified MIME type.

Goal / Sub-goal:

Performance efficiency / Throughput
) The result has been measured as GB/min/platform

Reliability / Stability Indicators
) The processing application has been implemented as a JAVA JAR map / reduce application
) All needed components (program logic, Hadoop method implementations, dependencies, Apache Tika™ 1.0 JAR) are integrated
) The result has been measured "manually" and reflected as a boolean value (true = met the requirements)

Reliability / Runtime stability
) Use Hadoop admin interface to identify failed tasks.
) Use Hadoops output to identify dropped records / any reported errors.
) The result has been measured as an integer value reflecting the number of identified run time failures.
Textual description of the evaluation and the overall goals
Evaluation-Date DD/MM/YY 28/08/12
Date of evaluation
Platform-ID string
Platform ONB 1 Unique ID of the platform involved in the particular evaluation - see Platform page included below
Dataset(s) string
100GB sub set of Austrian National Library - Web Archive Link to dataset page(s) on WIKI
Workflow method string
Hadoop map / reduce application implemented in JAVA (jar). Taverna / Commandline / Direct hadoop etc...
Workflow(s) involved
  Link(s) to MyExperiment if applicable
Tool(s) involved
URL(s) Hadoop cluster, tb-wc-hd-archd, Apache Tika™ 1.0 API Link(s) to distinct versions of specific components/tools in the component registry if applicable
Link(s) to Scenario(s) URL(s) Link(s) to scenario(s) if applicable

Platform ONB 1

Platform-ID String ONB 1
Unique string that identifies this specific platform.
Use the platform name
Platform description String Experimental cluster (setup 06.2012).
Cloudera CDH3u5.
8 (HT) cores per node. Using max. 7 cores for map / reduce slots (one for the OS).
Map / reduce slots ratio 6 / 1.
Human readable description of the platform. Where is it located, contact info, etc.
Number of nodes integer 5 Number of hosts involved - could be both physical hosts as well as virtual hosts
Total number of physical CPUs integer 5 Number of CPU's involved
CPU specs string Xeon [email protected] Quadcore CPU Specification of CPUs
Total number of CPU-cores integer 40 Cores (5 * 8 Cores)
Number of CPU-cores involved (4 physical Cores + 4 HT Cores = 8 Cores)
Total amount of RAM in Gbytes
integer 80GB (5 * 16GB)
Total amount of RAM on each nodes
average CPU-cores for nodes
integer 8 Cores
Number of CPU-cores in average across all nodes
avarage RAM in Gbytes for nodes
integer 16 GB
Amount of memory in average across all nodes
Operating System on nodes
String Ubuntu 10.04.04 LTS (64bit)
Linux (specific distribution), Windows (specific distribution), other?
Storage system/layer String HDFS
NFS, HDFS, local files, ?
Disk subsystem
2 x 1TB DISKs; configured as RAID0 => 2TB effective disk space Disk subsystem on each node
HDFS replication factor
Network layer between nodes String The CONTROLLER and the NODEs are connected to a GBit high performance network switch (guarantees the full GBit performance for each port). Speed of network interfaces, general network speed
Controller: CPU specs String 2 x Xeon [email protected] Quadcore CPU
Controller: RAM integer
24 GB
Controller: Disk subsystem String 3 x 1TB DISKs; configured as RAID5 => 2TB effective disk space

Evaluation points

metrics must come from / be registered in the metrics catalogue

Metric Baseline definition Baseline value Goal Evaluation 1 (28/08/12)
Evaluation 2 (date)
Evaluation 3 (date)
ThroughputGbytesPerMinute Virtual machine, Ubuntu Linux, 2GB RAM, Core i5 2,5GHz (single Processor VM configuration), Taverna Workbench workflow, TIKA 0.7 in API mode. 0,08 5 16,17    
ReliableAndStableAssessment The workflow incorporates different technologies (script, jar, beanshell, Taverna, unix tools) which makes it hard(er) to implement a reliable error handling (compared to a Java map/reduce implementation). false
true true    
NumberOfFailedFiles n/a (much smaller data set)
n/a 0 0    
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.