Asger Test of WIKI-template
link to TB.WP4 sharepoint site:https://portal.ait.ac.at/sites/Scape/TB/TB.WP.4/default.aspx
Evaluation specs
Field |
Datatype |
Description |
Value |
---|---|---|---|
Evaluator-ID | Unique ID of the evaluator that carried out this specific evaluator. Need system for identifying unique persons. Might be that evaluators need to be registered somewhere |
[email protected] | |
Evaluation describtion | text | Textual description of the evaluation and the overall goals |
Evaluate the precision and speed of Tika |
Evaluation-Date | DD/MM/YY | Date of evaluation |
8/2/12 |
Platform-ID | Unique ID of the platform involved in the particular evaluation - see Platform section |
||
Dataset-ID | Unique ID of the dataset involved in this evaluation - see Dataset section |
||
Workflow(s) involved |
Link(s) to MyExperiment if applicable |
||
Tool(s) involved |
URL | Link(s) to tools if applicable. We most likely need a unique way of registering tools (in distinct versions ?) to be able to compare results between tools and/or different versions of the same tool |
[http://wiki.opf-labs.org/display/TR/Tika|../../../../../../../../../../display/TR/Tika|||||||\||] released version 1.0 |
Link(s) to Scenario(s) | Links to scenario(s) if applicable |
||
Link(s) to relevant REF results / extracts / views |
Links to relevant REF results / extracts / views. REF most likely will need an interface to query and visualise results |
Platform specs
This part should not be filled out for every evaluation but "only" once for each platform. A platform is an instance of the SCAPE platform and anything from a developer-PC to the central SCAPE platform instance at IMF
Field |
Datatype |
Description |
Value |
---|---|---|---|
Platform-ID | Unique string that identifies this specific platform. Need system to avoid clashes between instances | ||
Platform description | text | Human readable description of the platform. Where is it located, contact info, etc. | State and University Library Blade server Array Iapetus |
Number of nodes | number | Number of hosts involved |
1 |
Number of CPU-cores in total | number |
Number of CPU-cores in total |
12 |
CPU specs for each node | text | CPU make and model for each node | 2 of these in each node: Intel® Xeon® Processor X5670 (12M Cache, 2.93 GHz, 6.40 GT/s Intel® QPI) |
Memory spec for each node | text | Memory speed and size for each node | TODO. About 70GB for each node |
Dataset specs
This part should not be filled out for every evaluation but "only" once for each dataset. Its important to link evaluations to datasets since we need to show progress from one evaluation to another during the project
Field |
Datatype |
Description |
Value |
---|---|---|---|
Dataset-ID | Unique ID of a dataset. Needs a way to register datasets with this unique ID |
||
Dataset description | text | Textual description of the dataset |
http://digitalcorpora.org/corpora/files![]() The 1000000 files dataset Use this for description http://digitalcorpora.org/corpora/files/govdocs1-simple-statistical-report ![]() |
Number of distinct file formats | number | e.g. based on PRONOM-ID | 213 |
Number of files in the dataset | number | 986278 | |
Total size of the dataset | size | 466 GB |
|
For each distinct fileformat in the dataset
|
Distinct numbers for each fileformat represented in the dataset THIS WILL BE UTTERLY PAINFUL TO COMPLETE FOR HETEOGENEOUS DATASETS |
||
Number of files for each filetype | list | Text File 88856 Graphics Interchange Format 36302 MS Windows Bitmap 72 PK Zip Archive 14 Targa Bitmap Image 1 DOS Batch File 3 AutoCAD Drawing 2 BASIC Script/Source Code 19 Gzip Unix Archive 14021 Arc Archive 0 Zoo Compressed Archive 1 Help File 7 dBase III/III+/IV/FoxBase+/FoxPro Database 2662 Dr. Halo Picture 0 AutoCAD Drawing Exchange (ASCII) 39 Lotus 123 Ver. 1 & 2 Worksheet 7 Lotus 123 Ver. 3 & 4 Worksheet 2 Lotus 123 Ver. 1 Worksheet 1 MS Excel Worksheet/Template (OLE) 62876 Encapsulated PostScript Preview 417 MS Works for DOS Document 1 Ventura Publisher File 2 WordPerfect Document 364 WordPerfect Support File 3 MS PowerPoint Slides/Add-on (OLE) 51292 Encapsulated PostScript Document 5079 X11 BitMap 7 JPEG File Interchange File 109283 HP Printer Control Language File 1 MS Excel Workspace/Workbook 1 ACT! 2.0 Report 1 MS Word for Macintosh Document 138 MS Word for DOS/Macintosh Document 54 MS Word for Windows Document (OLE) 76605 MS Windows MetaFile (placeable) 1 MS Windows 3.x Logo 1 Adobe Portable Document Format 231106 BinHex Archive 4 MS Rich Text Format Document 1067 MS Compound Document (OLE) 375 MS Windows Policy 2 Adobe PostScript Document 20630 MS Windows Shortcut/Link 2 HyperText Markup Language 202440 Tag Image File Format (Intel) 31 EXtensible Markup Language 32186 Adobe PhotoShop Image 2 Java Script Source Code File 379 Source Code Make File 59 C/C++ Source Code File 173 Printer Job Language Image 20 Adobe PostScript Document (PJL) 431 MS Developer Studio Project 2 Virtual Reality World (Binary) 12 Cascading Style Sheet 157 MS Visual C++ Resource Script 2 Java Source Code File 235 Shockwave Flash Object 3473 Adobe Illustrator Drawing 453 ANSI Text File 2 Active Server Page 149 Comma Separated Values Text File 18717 Setup Information 198 Initialization File 133 Printer Separator Page 25 Adobe Linguistics File 1 UHArc Compressed Archive 1 HTML + XML Namespace 4416 MS Visual Basic Class Module 1 Evolution Email Message 2 Horde Internet Messaging Program (IMP) Email Message 1 Mutt Email Message 1 EXtensible Style Language 10 MS Office Outlook 2003 Email Message 1 Microsoft Outlook 2000 IMO Email Message 6 MS Outlook Express Email Message 12 Pine Email Message 4 MS Office Macro Reference (OLE) 2 Common Gateway Interface Script 34 Code Page Translation File 1 MS Visual Studio Properties 2 Wise Installer Log 2 Log File (Unknown Source) 118 SGML Document Type Definition 255 AutoDesk Web Graphics Image 299 Internet Message 1251 MS PowerPoint Slides (XML) 115 Fractal Image File 3 Flexible Image Transport System Bitmap 1057 FrameMaker Document 28 GenePix Array List 146 MS Excel Graph 3 ISO 9660 CD-ROM Image (Data Mode 1) 1 Open Inventor 3d Scene (ASCII) 2 HP Printer Control Language Image (PJL) 3 Berkeley UNIX Mailbox Format 555 Eudora Mailbox 1 Monarch Graphic Image 1 Object Oriented Graphics Library: Quadrilaterals (ASCII) 4 Eudora Email Message 24 NASA Planetary Data Systems Image 162 Perl Application 66 AutoCAD Plot Drawing 1 Portable Network Graphics Bitmap 4125 MacPaint Bitmap 3 Lotus Freelance Graphics 97 File 2 Python Tkinter / UNIX Shell Script 220 XML Resource Description Framework 86 Semicolon Divided Values File 138 Standard Generalized Markup Lang 1641 ArcView GIS Shape 3 Structured Query Language Query 154 Structured Query Language Report / Program 5 Tape Archive (Compressed with Gzip) 5 Thumbs Plus Database 2 UU-Encoded File 1 MS Visual Basic Project 1 MS Visio 3/4 Document/Drawing/Shapes/Template 6 MS Write / Word Backup 35 MS Visual BASIC Source Code 5 MS Visual BASIC Form 2 MS Visual BASIC Script/Header 2 Text File: Unicode/DoubleByte/UTF-16LE 15 MS Excel Spreadsheet (XML) 507 MS Word Document (XML) 352 Text File (UTF-8) 24 Source Code (General) 1599 Tab Separated Values Text File 2822 MS Windows Media Active Stream 7 Pro/ENGINEER Geographic Image 1 Internet Message (MIME) 119 Adobe Portable Document (MacBinary) 168 Generic Sound Sample 1 NIST NSRL Hash Database 10 Bzip Archive V2 1 CPIO Archive 1 UFA Compressed Archive 35 PestPatrol Scan Strings 1 MS Access Report / Snapshot 26 Assembly Source Code File 256 MS C# Source Code 1 MS Outlook Rich Text Formatted Message 2 MathCaD Document 4 MIME HTML Web Page Archive 1 MapInfo Spatial Table 3 Virtual Calendar File 4 XML Schema 35 XML Paper Specification Document (Open XML) 1 VTeX Multiple Master Font Metrics 1 MS Visual Studio.NET Deployment Project 1 Web Service Description Language 3 MS Windows .NET Application Configuration 2 MS Word Document (Open XML) 164 MS Excel Spreadsheet (Open XML) 39 MS PowerPoint Presentation (Open XML) 220 UNIX Program/Program Library (32-bit) 1 UNIX Shell Archive 1 GNU Info Hypertext Document 1 EXtensible Markup Language (UTF-16LE) 24 EXtensible Markup Language (UTF-8) 960 ArcExplorer Project 2 Grace Project File 5 MS J# Source Code 123 Personal Home Page Script 10 Debian Linux Package 1 AppleSingle MIME Format 10 AppleDouble MIME Format 5 Google Earth Keyhole Markup Language 611 Medical Waveform Description 1 OpenDocument Text 2 AVS Field Data 1 Object Oriented Graphics Library: Objects (ASCII) 2 ACIS 3D Model 4 Facility for Interactive Generation File 5 MS Windows Media Player Play List 1 Perfect Office Document 2 The Bat! Email Message 2 Yahoo! Mail Email Message 1 OpenOffice Impress Presentation / Template 2 Applixware Graphic Image 2 MS Word for Windows Document (pre-OLE) 197 Adobe Acrobat Forms Data Format 2 LDAP Data Interchange Format 1 HyperText Markup Language (UTF-16BE) 1 HyperText Markup Language (UTF-16LE) 27 HyperText Markup Language (UTF-8) 27 MS InfoPath Document (XML) 1 Windows Policy Template 1 Affix File 14 NetCDF CDL Metadata 3 Logger Pro Data 133 MS Works Database 3 for Windows 4 Digital Asset Exchange File 1 Pretty Good Privacy Signed Message (ASCII) 7 Pretty Good Privacy Public Key Block (ASCII) 1 Linux Journalled Flash File System Image (JFFS,Intel) 5 dBase II Database 1 3D Systems Stereolithography CAD Image (Binary) 2 CGNS Advanced Data Format Database 18 ACE/gr Parameter Data (ASCII) 3 Palm OS Application 2 Palm OS Dynamic Library 1 Mobipocket eBook 1 MS Rich Text Format Document (Mac) 58 Wyko Vision Dataset (ASCII) 14 Google Earth Keyhole Markup Langage (Compressed) 660 MS FrontPage Document (XML) 81 Netscape Browser Bookmarks 6 Web Script Source Code File 17 Tgif Drawing 6 Apple Property List 2 ArcInfo Coverage Export 101 Earth Resource Mapping Satellite Image Header 1 |
|
Dataset owner |
If not freely available, who owns this dataset? |
http://digitalcorpora.org/![]() |
|
Dataset rights |
Under which conditions is the dataset available inside SCAPE and/or publicly |
Seems to be freely available, but I cannot find this information | |
Contact Person |
Who to contact to get more information about this dataset |
Leave comments on this page http://digitalcorpora.org/corpora/files ![]() |
Evaluation areas
1. Performance measures (automated as much as possible)
This area generally should evaluate how the workflow / component(s) / platform-instance perform?
Measure |
Description |
Goal |
Result |
---|---|---|---|
Speed | We need precise measures for this. |
||
Overall runtime |
As fast as possible | 169 minutes | |
Objects per second |
As fast as possible | 96 files per second | |
Technical measures |
Can we have single measures for these kinds of things that cover an entire workflow ? (across multiple nodes ?) E.g. average numbers across nodes and maximum load on a single node |
These were not measured for this experiment | |
CPU-usage |
Should this be per node ? or in total ? a percentage ? or any other measure ? |
||
RAM-usage |
Should this be per node ? or in total ? a percentage ? or any other measure ? | ||
Network-usage |
Should this be per node ? or in total ? a percentage ? or any other measure ? | ||
Disk I/O usage |
Should this be per node ? or in total ? a percentage ? or any other measure ? | ||
Scalability | We need precise measures for this. 4 dimensions described in the DoW |
||
Number of objects processed |
This is also described in the datasets section - but the same dataset might be used in different evaluations in different workflows/contexts |
977885 (only the objects which had a ground truth mimetype) |
977885 |
Size of objects |
This is also described in the datasets section (largest file) |
? |
|
Complexity of objects |
This could also be described in the datasets section ?? might be hard to measure. Number of known properties?, Compound objects? |
1. No interdependencies between objects. 2. Mostly wellknown formats, but a long tail of obscure formats 3. Many xml/html derivatives 4. Many text derivatives |
|
Heterogenety of dataset |
This could also be described in the datasets section ?? http://en.wikipedia.org/wiki/68-95-99.7_rule ![]() |
213 different formats 68 % of the dataset can be described with 5 formats = 2.13% of the formats 95% of the dataset can be described with 11 formats = 5.16% of the formats 99.7% of the dataset can be described with 49 formats = 23% of the formats This means that the remainder 164 formats in the dataset only account for 0.3% of the files |
|
? |
|||
Robustness | We need precise measures for this. Also needs to cover: Stability / Error handling Needs detailed output from the workflow execution |
|
|
Number of objects "failed" |
0 | 48425 | |
Number of crashes |
0 (IE. the identifier can fail to identify a file, which would be a "failure, but not crash) | 0 |
2. Manual Assessment (curators)
Does the actual workflow / component(s) actually do the job from a human point of view?
element |
decription |
result |
---|---|---|
Issue solved? |
To what extend does the issue owner feel happy? (e.g. fully, partially, not at all) |
fully |
Complexity of solution? |
e.g. can the solution be used by small institutions with a minimum professional FTEs (simple, moderate complex, complex) |
simple |
? |
3. SCAPE technical evaluation (TCC)
Does the workflow / component(s) comply to SCAPE technical standards?
Might be done like a kind of check list ?
[http://wiki.opf-labs.org/display/SP/The+SCAPE+Functional+Review+Process|../../../../../../../../../../display/SP/The+SCAPE+Functional+Review+Process|||||||||||||\||]
element | description |
checked |
---|---|---|
code checked into SCAPE Git |
X | |
solution documented on the WIKI |
% | |
? |
4. Integration evaluation (EXL and/or others - e.g. other repository owners)
How well is workflow / component(s) integratable into real life systems / scenarios like Rosetta.
Also about industrial / commercial readiness
Should be decided by a number of factors including how easy the results are to take away, use and productize
Move tools into an existing product (commercial or otherwise) ?