Tools
We will analyse and improve a set of tools, but just as important we will
create the nesscary software needed for integrating these tool to the SCAPE platform.
Selected tools
Tika
As described in the D9.1 report we identified some issues with Tika taht we need to take care of. The two most important is:
- support for format versioning
- support for regexp
During the latetst SPRUCE event, two issues with Tika were observed that we need to investigate futher:
- some crashes
- Investigation on some large TIFF files has indicated that parsing can cause Tika to run out of heap space. Add -Xmx1024m to fix.
- We may still encounter other problems however
- problem with encoding of output
UNIX file
- how can we include this in SCAPE
- can we utilise the magiclib in other tools? Add to it?
FITS
- FITS dispatches charaterisation to a series of (old) tools. We are not sure if we will use FITS as is, but we will investigate whether it would be feasible to use its format.
ffprobe
- used at SB for characterisation of video files prior to ingest into the DOMS.
ohcount
- for detecting and identifying programming languages
PDFBox
- e.g. for detecting DRM
New Zealand Metadata Extraction Tool
UIMA
- Unstructured Information Management Architecture
Calibre (for DRM)
Remarks
In the D9.1 report, in addition to Tika, we analysed FIDO and DROID. We will not work further
with these two tools. If any of the tools have been released in newer
versions prior to our next deliverable, we will re-evaluate them and, based on
that evaluation, reconsider them.
One of the bigger players in the DP field is JHOVE2. Due to multiple issues, we
will not work with that tools in this SCAPE year.
Overview of the tasks needed to complete foreach tool
- Create Debian packages
- Create toolspec
- Create components specification as described on page Preservation Components, Workflows, Planning and Platform integration
- Normalised input/output, maybe using FITS format for output
- We need to discuss how to run multiple tools on the same digital object. We will not re-implement FITS, but need something similar.
- Add the tools to the central instances (AIT and IM) and registries
- Describe provenance data available from the tools
- Gather conflicts between the tools as per FITS
- Hadoop/SCAPE readiness (API (if available) / command line usage (performance); how does if fit into the map/reduce paradigm (1 map task per record))
Scenarios
Everythin we do will still be correlated to relevant scenarios like the list
below
Formats and features of files
Formats
- Microsoft Office formats
- JP2000
- TIFF
- audio and video
- text
- ISO images etc
- RAW image files
Web content Characterisation
Web content is a bit special as we have efforts especially directed towards
text analysis and text mining. This effort is primarily driven by TUB and will
be guided by this roadmap
Tasks and intended checkpoints
- Define set of source data types together with UG
- compile by end of May and disseminate in the mailing list
- Prioritize requirements from PW
- May-June
- Comparatively evaluate IE extraction methods (ReVerb, N-ary relations)
- preliminary study finished and documented by mid May
- large scale study by mid autumn - depending on scalability issues
- Setup UIMA pipeline for extraction on small- then medium dataset
- focused crawls for different data types
- initial architecture by June/July
- Possible datasets:
- Focused crawls
- ClueWeb corpus (main reference corpus for work on Web documents)
- Gigaword corpus
- Corpora provided by partners
- Dissemination via a (perhaps publicly accessible) web demonstrator + blogposts, presentations
- Evaluation and integration to be discussed with PW
- Implementation of integration
Features
DRM
- How do we detect DRM
- Which characteristics are interesting regarding DRM?
- How do we stand with regard to legal issues and DRM?
- can we, if nessecary, remove DRM?
- The actual removal of DRM belongs in the Action Components WP
- We need Scenarios pertaining to DRM
- DRM "formats" are often volatile but we need to provide access over long time spans.
- We have heard of a "Rights Expression Language" but is that in use by anyone?
- How do we deal with encryption?
Composite/complex objects
- scripts in an HTML page
- embedded objects, e.g. a MP3 file in a docx file
- ZIP, TAR, GZ, etc
Integration with the rest of SCAPE
Integration
For the Planning and Watch SP we need to supply a schema of properties. I.e. a
schema that describes the kind of data a given tool will be able to deliver.
The below section outlines some ideas from from Luis Faria, KEEPS.
I guess we will need 4 interfaces between PC.CC and PW:
- Take Identifier scheme for formats: based on the MIME type, extended with version
information via parameters, as defined by Andy at
http://wiki.opf-labs.org/display/SP/Proposal+-+Extended+MIME+Type+Identifiers
(I think was consensual in the PC-SP meeting, still needs to be defined exactly
what parameters and whereas to add other info like codec or other properties,
without making the identifier ambiguous, maybe SB should lead this as it will
be the one that produces this information)
- Identifier scheme for tools: The proposed format is based on Debian
packaging, extended to add operative system information, but there is no
written work about about this (except KEEPS presentation at PC-SP meeting) and
actually no consensus due to lack of time. KEEPS will continue to engage other
partners for discussion and can lead this as it will be produced as part of
toolspec.
- Deep characterization output format: There is nothing defined about this,
and the only similar work that I know of is on FITS (
http://code.google.com/p/fits/) that does some normalization of the output. I
don't know much about this, maybe Petar can give you more details as he is the
one in Watch that is using FITS and will need to know more about this format.
Nevertheless, I think SB should lead this.
- Tool information: This will be defined in toolspec, to be updated by KEEPS
with the proposals presented in the meeting and with other requisites that will
come from PW. Toolspec schema is available here:
https://github.com/openplanets/scape/blob/master/scape-core/src/main/resources/eu/scape_project/core/model/toolspec/toolspec.xsd
The SCAPE Platform
- Hadoop
- The central istances as AIT and IM
REF
- integrate the evaluation framework with REF
- automate evaluation of tools using a given corpus with a ground truth
Other WP
- Policy driven validation
- SB has startet looking into this
- Evaluation of results
- Automated Watch
- Create training material
- Do we have enough effort for this?
- We need a training manager in either each WP or one (supported by a small
group) at the SP level.