Skip to end of metadata
Go to start of metadata

Markus Test of WIKI-template

link to TB.WP4 sharepoint site:https://portal.ait.ac.at/sites/Scape/TB/TB.WP.4/default.aspx

Evaluation specs

Field
Datatype
Description
Evaluator-ID   Evaluator1234
Evaluation describtion   Evaluate whether the results of the “experimental web content characterization work flows” are suitable to foster the knowledge on how to deal with the provided web content data set, its processing , optimized workflow design and tool usage / requirements or not.
Evaluation-Date   09.02.2012
Platform-ID   Platform1234    ****(did not find an ID in platform section)
Dataset-ID   Austrian National Library - Web Archive    ****(found "IDs" for Issues and Solutions – but not for Datasets)
Workflow(s) involved
  "scanARC" (internally published on SCAPE sharepoint together with a deployment guide for local setup)
Tool(s) involved
  Apache TIKA 0.7
DROID 6.0.1 by http://www.nationalarchives.gov.uk/
unARC tool by SB
TIFOWA tool by ONB
csv2TIFOWA tool by ONB
MergeTifowaReports tool by ONB
Link(s) to Scenario(s)   WCT4 Web Archive Mime-Type detection at Austrian National Library
Link(s) to relevant REF
results / extracts / views
  Currently not applicable.

Platform specs

This part should not be filled out for every evaluation but "only" once for each platform. A platform is an instance of the SCAPE platform and anything from a developer-PC to the central SCAPE platform instance at IMF

Field
Datatype
Description
Platform-ID   Platform1234    ****(did not find an ID in platform section)
Platform description   Local Test System
Number of nodes   1
Total number of physical CPUs   1
Total number of CPU-cores   2
Total amount of RAM   2GB
Operating system
  Ubuntu Linux 11.10 (32bit)
Workflow Platform
  Taverna Workbench 2.3.0
     

Dataset specs

This part should not be filled out for every evaluation but "only" once for each dataset. Its important to link evaluations to datasets since we need to show progress from one evaluation to another during the project

Field
Datatype
Description
Dataset-ID   Austrian National Library - Web Archive  ****(found "IDs" for Issues and Solutions – but not for Datasets)
Dataset description   The web archive data is available in the ARC.GZ format.
The size of the data set is approximately 2 GB. Split over 20 ARC.GZ files.
Number of distinct file formats   1 File format (ARC.GZ)
For each distinct fileformat in the dataset
  • Number of files
  • Number of bytes
  • Largest file
  • Smallest file
  • Avg. file size
  20 Files in total.
2087,8MB in total.
129,6MB largest.
40,9MB smallest.


**** I took the ARC.GZ as file format. Choosing the files inside the ARC.GZ would take some pages and would reflect the entire experiment outcome - but could look like this (for a very small data set):

TYPE                                           COUNT                    PERCENTAGE
image/jpeg                                   1809                        36.420376
text/html;charset=iso-8859-1           765                        15.401651
image/gif                                        748                       15.059392
text/html                                        560                       11.274411
application/xhtmlxml                       227                        4.5701632
text/html;charset=utf-8                    212                        4.26817
text/plain                                        168                       3.3823233
image/png                                      158                       3.1809945
text/html;charset=windows-1252      138                       2.778337
application/pdf                                  48                       0.9663781
text/html;charset=iso-8859-15            37                       0.74491644
text/html;charset=windows-1251        31                       0.62411916
application/x-shockwave-flash            29                       0.5838534
text/html;charset=latin-1                   10                        0.20132877
application/rssxml                              6                        0.12079726
application/xml                                   5                       0.100664385
image/x-icon                                      4                       0.08053151
text/css;charset=utf-8                         3                       0.06039863
text/html;charset=windows-1254          2                       0.040265754
video/quicktime                                   1                      0.020132877
image/tiff                                            1                      0.020132877
audio/x-wav                                        1                      0.020132877
application/zip                                    1                      0.020132877
application/vndms-powerpoint               1                      0.020132877
application/octet-stream                      1                      0.020132877
application/msword                             1                      0.020132877
 
Dataset owner
  Not freely available. Owned by ONB.
Dataset rights
  ONB.
Contact Person
  Prändl-Zika Veronika (ONB).

Evaluation areas

1. Performance measures (automated as much as possible)

This area generally should evaluate how the workflow / component(s) / platform-instance perform?

Measure
Description
Goal
Result
Speed We need precise measures for this. **** Currently we do not have precise measures. Only some "estimates" using taverna's
progress report tab.
   
Overall runtime
32 minutes.
   
Objects per second
45 objects / second.
   
Technical measures
Can we have single measures for these kinds of things that cover an entire workflow ? (across multiple nodes ?)
   
CPU-usage
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers) to verify that scaling is working correctly.    
RAM-usage
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers) to verify that scaling is working correctly.    
Network-usage
Currently not available. **** We need that fine granular. Worker Node <=> Controller. Controller <=> Storage (NAS, HDFS,...). Worker Node <=> Storage (NAS, HDFS,...). Depending on the Infrastructure design.
   
Disk I/O usage
Currently not available. **** During workflow design / development we should have it per node (including the controllers not only the workers). NAS might become the bottleneck. How to measure this on HDFS?
   
**** Memory "paging measures" on OS level
**** To identify memory shortage. Might be easier to measure and more reliable than RAM usage in a very heterogeneous worker node environment.
   
Scalability Not applicable for this experiment.
   
Number of objects processed
Not applicable for this experiment.
   
Size of objects
Not applicable for this experiment.    
Complexity of objects
See datasets section for this experiment.    
Heterogenety of dataset
See datasets section for this experiment.    
?
     
Robustness No errors.
   
Number of objects "failed"
0    
       

2. Manual Assessment (curators)

Does the actual workflow / component(s) actually do the job from a human point of view?

element
decription
result
Issue solved?
The results can be used to better estimate the amount of storage needed during the processing phase of the extracted web content, the influence of the involved tools on performance, error handling for robust operation and the impact of workflow design (parallelization) on the performance of the workflows.
The solution gives a very good view about the content type distribution found in typical web archive containers.
DONE
Complexity of solution?
The solution can be used by a single person on a single machine after implementing the workflow based on the provided deployment instructions.
 
?
   

3. SCAPE technical evaluation (TCC)

Does the workflow / component(s) comply to SCAPE technical standards?

Might be done like a kind of check list ? 

[http://wiki.opf-labs.org/display/SP/The+SCAPE+Functional+Review+Process|../../../../../../../../../../display/SP/The+SCAPE+Functional+Review+Process|||||||||||\||]

element description
checked
code checked into SCAPE Git
pending  
solution documented on the WIKI
pending  
?
   

4. Integration evaluation (EXL and/or others - e.g. other repository owners)

How well is workflow / component(s) integratable into real life systems / scenarios like Rosetta.

Also about industrial / commercial readiness

Should be decided by a number of factors including how easy the results are to take away, use and productize

Move tools into an existing product (commercial or otherwise) ?

Evaluate whether the results of the “experimental web content characterization work flows” are suitable to foster the knowledge on how to deal with the provided web content data set, its processing , optimized workflow design and tool usage / requirements.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.