Skip to end of metadata
Go to start of metadata

Tools

We will analyse and improve a set of tools, but just as important we will
create the nesscary software needed for integrating these tool to the SCAPE platform.

Selected tools

Tika

As described in the D9.1 report we identified some issues with Tika taht we need to take care of. The two most important is:

  • support for format versioning
  • support for regexp

During the latetst SPRUCE event, two issues with Tika were observed that we need to investigate futher:

  • some crashes
    • Investigation on some large TIFF files has indicated that parsing can cause Tika to run out of heap space.  Add -Xmx1024m to fix.
    • We may still encounter other problems however
  • problem with encoding of output

UNIX file

  • how can we include this in SCAPE
  • can we utilise the magiclib in other tools? Add to it?

FITS

  • FITS dispatches charaterisation to a series of (old) tools. We are not sure if we will use FITS as is, but we will investigate whether it would be feasible to use its format.

ffprobe

  • used at SB for characterisation of video files prior to ingest into the DOMS.

ohcount

  • for detecting and identifying programming languages

PDFBox

  • e.g. for detecting DRM

New Zealand Metadata Extraction Tool

UIMA

  • Unstructured Information Management Architecture

Calibre (for DRM)

Remarks

In the D9.1 report, in addition to Tika, we analysed FIDO and DROID. We will not work further
with these two tools. If any of the tools have been released in newer
versions prior to our next deliverable, we will re-evaluate them and, based on
that evaluation, reconsider them.

One of the bigger players in the DP field is JHOVE2. Due to multiple issues, we
will not work with that tools in this SCAPE year.

Overview of the tasks needed to complete foreach tool

  • Create Debian packages
  • Create toolspec
  • Create components specification as described on page Preservation Components, Workflows, Planning and Platform integration
  • Normalised input/output, maybe using FITS format for output
  • We need to discuss how to run multiple tools on the same digital object. We will not re-implement FITS, but need something similar.
  • Add the tools to the central instances (AIT and IM) and registries
  • Describe provenance data available from the tools
  • Gather conflicts between the tools as per FITS
  • Hadoop/SCAPE readiness (API (if available) / command line usage (performance); how does if fit into the map/reduce paradigm (1 map task per record))

Scenarios

Everythin we do will still be correlated to relevant scenarios like the list
below

Formats and features of files

Formats

  • PDF
  • Microsoft Office formats
  • JP2000
  • TIFF
  • audio and video
  • text
  • ISO images etc
  • RAW image files

Web content Characterisation

Web content is a bit special as we have efforts especially directed towards
text analysis and text mining. This effort is primarily driven by TUB and will
be guided by this roadmap

Tasks and intended checkpoints
  • Define set of source data types together with UG
    • compile by end of May and disseminate in the mailing list 
  • Prioritize requirements from PW
    • May-June
  • Comparatively evaluate IE extraction methods (ReVerb, N-ary relations)
    • preliminary study finished and documented by mid May
    • large scale study by mid autumn - depending on scalability issues
  • Setup UIMA pipeline for extraction on small- then medium dataset
    • focused crawls for different data types
    • initial architecture by June/July
  • Possible datasets:
    • Focused crawls
    • ClueWeb corpus (main reference corpus for work on Web documents)
    • Gigaword corpus
    • Corpora provided by partners
  • Dissemination via a (perhaps publicly accessible) web demonstrator + blogposts, presentations
  • Evaluation and integration to be discussed with PW
    • Implementation of integration

Features

DRM

  • How do we detect DRM
  • Which characteristics are interesting regarding DRM?
  • How do we stand with regard to legal issues and DRM?
    • can we, if nessecary, remove DRM?
    • The actual removal of DRM belongs in the Action Components WP
  • We need Scenarios pertaining to DRM
  • DRM "formats" are often volatile but we need to provide access over long time spans.
  • We have heard of a "Rights Expression Language" but is that in use by anyone?
  • How do we deal with encryption?

Composite/complex objects

  • scripts in an HTML page
  • embedded objects, e.g. a MP3 file in a docx file
  • ZIP, TAR, GZ, etc

Integration with the rest of SCAPE

Integration

For the Planning and Watch SP we need to supply a schema of properties. I.e. a
schema that describes the kind of data a given tool will be able to deliver.

The below section outlines some ideas from from Luis Faria, KEEPS.

I guess we will need 4 interfaces between PC.CC and PW:

  • Take Identifier scheme for formats: based on the MIME type, extended with version
    information via parameters, as defined by Andy at
    http://wiki.opf-labs.org/display/SP/Proposal+-+Extended+MIME+Type+Identifiers
    (I think was consensual in the PC-SP meeting, still needs to be defined exactly
    what parameters and whereas to add other info like codec or other properties,
    without making the identifier ambiguous, maybe SB should lead this as it will
    be the one that produces this information)
  • Identifier scheme for tools: The proposed format is based on Debian
    packaging, extended to add operative system information, but there is no
    written work about about this (except KEEPS presentation at PC-SP meeting) and
    actually no consensus due to lack of time. KEEPS will continue to engage other
    partners for discussion and can lead this as it will be produced as part of
    toolspec.
  • Deep characterization output format: There is nothing defined about this,
    and the only similar work that I know of is on FITS (
    http://code.google.com/p/fits/) that does some normalization of the output. I
    don't know much about this, maybe Petar can give you more details as he is the
    one in Watch that is using FITS and will need to know more about this format.
    Nevertheless, I think SB should lead this.

The SCAPE Platform

  • Hadoop
  • The central istances as AIT and IM

REF

  • integrate the evaluation framework with REF
  • automate evaluation of tools using a given corpus with a ground truth

Other WP

  • Policy driven validation
    • SB has startet looking into this
  • Evaluation of results
  • Automated Watch
  • Create training material
    • Do we have enough effort for this?
    • We need a training manager in either each WP or one (supported by a small
      group) at the SP level.
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.