We will analyse and improve a set of tools, but just as important we will
create the nesscary software needed for integrating these tool to the SCAPE platform.
As described in the D9.1 report we identified some issues with Tika taht we need to take care of. The two most important is:
- support for format versioning
- support for regexp
During the latetst SPRUCE event, two issues with Tika were observed that we need to investigate futher:
- some crashes
- Investigation on some large TIFF files has indicated that parsing can cause Tika to run out of heap space. Add -Xmx1024m to fix.
- We may still encounter other problems however
- problem with encoding of output
- how can we include this in SCAPE
- can we utilise the magiclib in other tools? Add to it?
- FITS dispatches charaterisation to a series of (old) tools. We are not sure if we will use FITS as is, but we will investigate whether it would be feasible to use its format.
- used at SB for characterisation of video files prior to ingest into the DOMS.
- for detecting and identifying programming languages
- e.g. for detecting DRM
New Zealand Metadata Extraction Tool
- Unstructured Information Management Architecture
Calibre (for DRM)
In the D9.1 report, in addition to Tika, we analysed FIDO and DROID. We will not work further
with these two tools. If any of the tools have been released in newer
versions prior to our next deliverable, we will re-evaluate them and, based on
that evaluation, reconsider them.
One of the bigger players in the DP field is JHOVE2. Due to multiple issues, we
will not work with that tools in this SCAPE year.
Overview of the tasks needed to complete foreach tool
- Create Debian packages
- Create toolspec
- Create components specification as described on page Preservation Components, Workflows, Planning and Platform integration
- Normalised input/output, maybe using FITS format for output
- We need to discuss how to run multiple tools on the same digital object. We will not re-implement FITS, but need something similar.
- Add the tools to the central instances (AIT and IM) and registries
- Describe provenance data available from the tools
- Gather conflicts between the tools as per FITS
- Hadoop/SCAPE readiness (API (if available) / command line usage (performance); how does if fit into the map/reduce paradigm (1 map task per record))
Everythin we do will still be correlated to relevant scenarios like the list
- LSDRT3 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Formats and features of files
- Microsoft Office formats
- audio and video
- ISO images etc
- RAW image files
Web content Characterisation
Web content is a bit special as we have efforts especially directed towards
text analysis and text mining. This effort is primarily driven by TUB and will
be guided by this roadmap
Tasks and intended checkpoints
- Define set of source data types together with UG
- compile by end of May and disseminate in the mailing list
- Prioritize requirements from PW
- Comparatively evaluate IE extraction methods (ReVerb, N-ary relations)
- preliminary study finished and documented by mid May
- large scale study by mid autumn - depending on scalability issues
- Setup UIMA pipeline for extraction on small- then medium dataset
- focused crawls for different data types
- initial architecture by June/July
- Possible datasets:
- Focused crawls
- ClueWeb corpus (main reference corpus for work on Web documents)
- Gigaword corpus
- Corpora provided by partners
- Dissemination via a (perhaps publicly accessible) web demonstrator + blogposts, presentations
- Evaluation and integration to be discussed with PW
- Implementation of integration
- How do we detect DRM
- Which characteristics are interesting regarding DRM?
- How do we stand with regard to legal issues and DRM?
- can we, if nessecary, remove DRM?
- The actual removal of DRM belongs in the Action Components WP
- We need Scenarios pertaining to DRM
- DRM "formats" are often volatile but we need to provide access over long time spans.
- We have heard of a "Rights Expression Language" but is that in use by anyone?
- How do we deal with encryption?
- scripts in an HTML page
- embedded objects, e.g. a MP3 file in a docx file
- ZIP, TAR, GZ, etc
Integration with the rest of SCAPE
For the Planning and Watch SP we need to supply a schema of properties. I.e. a
schema that describes the kind of data a given tool will be able to deliver.
The below section outlines some ideas from from Luis Faria, KEEPS.
I guess we will need 4 interfaces between PC.CC and PW:
- Take Identifier scheme for formats: based on the MIME type, extended with version
information via parameters, as defined by Andy at
(I think was consensual in the PC-SP meeting, still needs to be defined exactly
what parameters and whereas to add other info like codec or other properties,
without making the identifier ambiguous, maybe SB should lead this as it will
be the one that produces this information)
- Identifier scheme for tools: The proposed format is based on Debian
packaging, extended to add operative system information, but there is no
written work about about this (except KEEPS presentation at PC-SP meeting) and
actually no consensus due to lack of time. KEEPS will continue to engage other
partners for discussion and can lead this as it will be produced as part of
- Deep characterization output format: There is nothing defined about this,
and the only similar work that I know of is on FITS (
http://code.google.com/p/fits/) that does some normalization of the output. I
don't know much about this, maybe Petar can give you more details as he is the
one in Watch that is using FITS and will need to know more about this format.
Nevertheless, I think SB should lead this.
- Tool information: This will be defined in toolspec, to be updated by KEEPS
with the proposals presented in the meeting and with other requisites that will
come from PW. Toolspec schema is available here:
The SCAPE Platform
- The central istances as AIT and IM
- integrate the evaluation framework with REF
- automate evaluation of tools using a given corpus with a ground truth
- Policy driven validation
- SB has startet looking into this
- Evaluation of results
- Automated Watch
- Create training material
- Do we have enough effort for this?
- We need a training manager in either each WP or one (supported by a small
group) at the SP level.