This page and its children are--
for starters--rough and unstructured info. Consider this a drop box for issues and ideas
Tools we'll work with:
- Tika crashes during parsing (SPRUCE sample). Needs investigation
- Tika Character encoding of output (caused problems with automatically processing results)
- Identification improvements (sample from SPRUCE: some files only identified as octet-stream, despite being a common format like msword)
- UNIX file
- ohcount (for directing computer languages)
- Calibre (for DRM detection)?
- tools from TUB
- UIMA (Unstructured Information Management Architecture)
- The PDF format is important
- At SB we have lots and lots of audio and video objects
- The TUB efforts are mainly working with text objects
Format naming schema:
- It is important to follow the agreed naming schema, not just within PC but also with PW.
- The agreed naming schema is an extension of the MIME type, e.g. video/avi;version=1.0
- All characterization tools, identification and deep characterization, should follow this schema. This includes Tika and FITS or analogous deep characterization tool.
- More information about the naming schema on the Den Hag meeting Action Tools presentation.
We'll improve the tool evaluation framework. We'll also look into integrating it with REF. In the long run we'll have an automatic tool evaluator for on the fly evaluation of our tools.
We need to discuss and involve the Evaluation of Results work package.
In a year we will repeat the (automatic) tool evaluation of the at that time newest versions of DROID and FIDO to compare them to the present versions.
We need discussion regarding FITS and JHOVE2. What value can they add to this WP? We will evaluate the FITS output format for use for communication between sub projects.
File format -> application mapping (creation software? rendering software?)
How are we going to characterise composite objects?
- unpacking versus streaming
- as this issue can be arbitrary complex, we need to decide
- what we consider an composite object
- how deep we want characterisation to go?
- what (kind of) characteristics are we interested in?
- We need Scenarios involving complex/composite objects
- ISO/Image files
- Identification of contents (what file formats are on the image?)
How are we going to detect and characterise DRM?
- How do we detect DRM
- Which characteristics are interesting regarding DRM?
- We need Scenarios pertaining to DRM
- How do we stand with regard to legal issues and DRM? Can we, if necessary, remove DRM? There are lots of tools for that kind of action, but when would we be allowed to use them? The actual removal of DRM belongs in the Action Components WP.
- DRM "formats" are often volatile but we need to provide access over long time spans.
- We have heard of a "Rights Expression Language" but is that in use by anyone?
- How do we deal with encryption?
- We need to begin working with policy driven validation
- We need to create Debian packages for the tools.
- We need to add the tools to the SCAPE Catalogue.
- We need to describe the approach devised by Andy where he used strace for analysis of technological dependencies of renderes.
- We need to begin working on policy driven validation
- We need to define a format for communication with other WP.
- communicating results
- communicating tool
- All relevant scenarios will be taken into account during the work
A concrete goal to reach before June:
- Run identification/characterisation on a set of files
- Load the result into REF
- Compare the result with existing results in REF
- How do we incorporate the SCAPE Platform/Hadoop into our work?
- How do this work package fit together with PT sub project?
- We need training in the SCAPE Platform (maybe at the upcoming work shop in June?)
Some personal notes on a basic Hadoop experiment using web archive meta data:
Moved to and maintained at: Web Content Testbed - Next steps
- How do Taverna fit with this work package?
- how do Taverna and Hadoop (and the SCAPE platform) fit together?
Regarding FITS or similar:
(just outlining some thoughts from PW, I will add more structured info soon, if you consider this helpful)
- There is a need for more than just File Format/Mimetype identification.
- A normalized vocabulary of the known characterization properties.
- a set of properties
- data type/format
- Identical properties from different tools have to normalized.
- Provenance information would be nice
- Conflicts as in FITS would be nice
- Confidence level of the values would be nice
- One tool/invocation point for all of this would be nice
Web Content Characterization:
Tasks and intended checkpoints
- Define set of source data types together with UG
- compile by end of May and disseminate in the mailing list
- Prioritize requirements from PW
- Comparatively evaluate IE extraction methods (ReVerb, N-ary relations)
- preliminary study finished and documented by mid May
- large scale study by mid autumn - depending on scalability issues
- Setup UIMA pipeline for extraction on small- then medium dataset
- focused crawls for different data types
- initial architecture by June/July
- Possible datasets:
- Focused crawls
- ClueWeb corpus (main reference corpus for work on Web documents)
- Gigaword corpus
- Corpora provided by partners
- Dissemination via a (perhaps publicly accessible) web demonstrator + blogposts, presentations
- Evaluation and integration to be discussed with PW
- Implementation of integration