Markus Test of WIKI-template
link to TB.WP4 sharepoint site:https://portal.ait.ac.at/sites/Scape/TB/TB.WP.4/default.aspx
Evaluation specs
Field |
Datatype |
Description |
---|---|---|
Evaluator-ID | Evaluator1234 | |
Evaluation describtion | Evaluate whether the results of the “experimental web content characterization work flows” are suitable to foster the knowledge on how to deal with the provided web content data set, its processing , optimized workflow design and tool usage / requirements or not. |
|
Evaluation-Date | 09.02.2012 |
|
Platform-ID | Platform1234 ****(did not find an ID in platform section) |
|
Dataset-ID | Austrian National Library - Web Archive![]() |
|
Workflow(s) involved |
"scanARC" (internally published on SCAPE sharepoint together with a deployment guide for local setup) |
|
Tool(s) involved |
Apache TIKA 0.7 DROID 6.0.1 by http://www.nationalarchives.gov.uk/ ![]() unARC tool by SB TIFOWA tool by ONB csv2TIFOWA tool by ONB MergeTifowaReports tool by ONB |
|
Link(s) to Scenario(s) | WCT4 Web Archive Mime-Type detection at Austrian National Library![]() |
|
Link(s) to relevant REF results / extracts / views |
Currently not applicable. |
Platform specs
This part should not be filled out for every evaluation but "only" once for each platform. A platform is an instance of the SCAPE platform and anything from a developer-PC to the central SCAPE platform instance at IMF
Field |
Datatype |
Description |
---|---|---|
Platform-ID | Platform1234 ****(did not find an ID in platform section) | |
Platform description | Local Test System |
|
Number of nodes | 1 |
|
Total number of physical CPUs | 1 |
|
Total number of CPU-cores | 2 |
|
Total amount of RAM | 2GB |
|
Operating system |
Ubuntu Linux 11.10 (32bit) | |
Workflow Platform |
Taverna Workbench 2.3.0 |
|
Dataset specs
This part should not be filled out for every evaluation but "only" once for each dataset. Its important to link evaluations to datasets since we need to show progress from one evaluation to another during the project
Field |
Datatype |
Description |
---|---|---|
Dataset-ID | Austrian National Library - Web Archive![]() |
|
Dataset description | The web archive data is available in the ARC.GZ format. The size of the data set is approximately 2 GB. Split over 20 ARC.GZ files. |
|
Number of distinct file formats | 1 File format (ARC.GZ) |
|
For each distinct fileformat in the dataset
|
20 Files in total. 2087,8MB in total. 129,6MB largest. 40,9MB smallest. **** I took the ARC.GZ as file format. Choosing the files inside the ARC.GZ would take some pages and would reflect the entire experiment outcome - but could look like this (for a very small data set): TYPE COUNT PERCENTAGE image/jpeg 1809 36.420376 text/html;charset=iso-8859-1 765 15.401651 image/gif 748 15.059392 text/html 560 11.274411 application/xhtmlxml 227 4.5701632 text/html;charset=utf-8 212 4.26817 text/plain 168 3.3823233 image/png 158 3.1809945 text/html;charset=windows-1252 138 2.778337 application/pdf 48 0.9663781 text/html;charset=iso-8859-15 37 0.74491644 text/html;charset=windows-1251 31 0.62411916 application/x-shockwave-flash 29 0.5838534 text/html;charset=latin-1 10 0.20132877 application/rssxml 6 0.12079726 application/xml 5 0.100664385 image/x-icon 4 0.08053151 text/css;charset=utf-8 3 0.06039863 text/html;charset=windows-1254 2 0.040265754 video/quicktime 1 0.020132877 image/tiff 1 0.020132877 audio/x-wav 1 0.020132877 application/zip 1 0.020132877 application/vndms-powerpoint 1 0.020132877 application/octet-stream 1 0.020132877 application/msword 1 0.020132877 |
|
Dataset owner |
Not freely available. Owned by ONB. |
|
Dataset rights |
ONB. |
|
Contact Person |
Prändl-Zika Veronika![]() |
Evaluation areas
1. Performance measures (automated as much as possible)
This area generally should evaluate how the workflow / component(s) / platform-instance perform?
Measure |
Description |
Goal |
Result |
---|---|---|---|
Speed | We need precise measures for this. **** Currently we do not have precise measures. Only some "estimates" using taverna's progress report tab. |
||
Overall runtime |
32 minutes. |
||
Objects per second |
45 objects / second. |
||
Technical measures |
Can we have single measures for these kinds of things that cover an entire workflow ? (across multiple nodes ?) |
||
CPU-usage |
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers) to verify that scaling is working correctly. | ||
RAM-usage |
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers) to verify that scaling is working correctly. | ||
Network-usage |
Currently not available. **** We need that fine granular. Worker Node <=> Controller. Controller <=> Storage (NAS, HDFS,...). Worker Node <=> Storage (NAS, HDFS,...). Depending on the Infrastructure design. |
||
Disk I/O usage |
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers). NAS might become the bottleneck. How to measure this on HDFS? |
||
**** Memory "paging measures" on OS level |
**** To identify memory shortage. Might be easier to measure and more reliable than RAM usage in a very heterogeneous worker node environment. |
||
Scalability | Not applicable for this experiment. |
||
Number of objects processed |
Not applicable for this experiment. |
||
Size of objects |
Not applicable for this experiment. | ||
Complexity of objects |
See datasets section for this experiment. | ||
Heterogenety of dataset |
See datasets section for this experiment. | ||
? |
|||
Robustness | No errors. |
||
Number of objects "failed" |
0 | ||
2. Manual Assessment (curators)
Does the actual workflow / component(s) actually do the job from a human point of view?
element |
decription |
result |
---|---|---|
Issue solved? |
The results can be used to better estimate the amount of storage needed during the processing phase of the extracted web content, the influence of the involved tools on performance, error handling for robust operation and the impact of workflow design (parallelization) on the performance of the workflows. The solution gives a very good view about the content type distribution found in typical web archive containers. |
DONE |
Complexity of solution? |
The solution can be used by a single person on a single machine after implementing the workflow based on the provided deployment instructions. |
|
? |
3. SCAPE technical evaluation (TCC)
Does the workflow / component(s) comply to SCAPE technical standards?
Might be done like a kind of check list ?
[http://wiki.opf-labs.org/display/SP/The+SCAPE+Functional+Review+Process|../../../../../../../../../../display/SP/The+SCAPE+Functional+Review+Process|||||||||||\||]
element | description |
checked |
---|---|---|
code checked into SCAPE Git |
pending | |
solution documented on the WIKI |
pending | |
? |
4. Integration evaluation (EXL and/or others - e.g. other repository owners)
How well is workflow / component(s) integratable into real life systems / scenarios like Rosetta.
Also about industrial / commercial readiness
Should be decided by a number of factors including how easy the results are to take away, use and productize
Move tools into an existing product (commercial or otherwise) ?
Evaluate whether the results of the “experimental web content characterization work flows” are suitable to foster the knowledge on how to deal with the provided web content data set, its processing , optimized workflow design and tool usage / requirements.