This wiki page describes the implementation and packaging for the SCAPE check point CP070. The work has been carried out in the Characterisation Components work package in the Preservation Components sub-project.
The following institutions have been involved in meeting this check point
- British Library (BL)
- Microsoft Research Cambridge (MSR)
- Österreichische Nationalbibliothek (ONB)
- Open Planets Foundation (OPF)
- Statsbiblioteket (SB)
- Technischen Universität Berlin (TUB)
The work was broken up in four distinct sub-projects for which each responsible institution reports the following
A System for Unsupervised Discovery of Relations and Discriminative Extraction Patterns from the WWW
Work done by Technische Universität Berlin.
- A short description of the system AND stating its relevance to SCAPE/digital preservation. Maybe link to longer description
- "Klump" (working title only) is a system that automatically collects preservation-relevant information from the Web and stores it in machine-readable structured data.
- The system is intended to monitor, validate, or get indicators of policies and provide this information to the Watch component, i.e. Scout.
- Link to src and release tag if possible. If this is not possible a description of why that is so.
- The system is a research prototype currently undergoing heavy refactoring and development.
- Link to paper
- Prototype and initial evaluation paper: KrakeN: N-ary Facts in Open Information Extraction. Alan Akbik and Alexander Löser NAACL-HLT 2012
Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
(AKBC-WEKEX). Accessible at: http://www.aclweb.org/anthology/W/W12/W12-30.pdf#page=62 - Full system paper: Unsupervised Discovery of Relations and Discriminative Extraction Patterns. Alan Akbik, Larysa
Visengeriyeva, Priska Herger, Holmer Hemsen, und Alexander Löser 24th. International Conference on
Computational Linguistics (COLING) 2012. http://www1.ccls.columbia.edu/~habash/coling-2012-citations/C01.pdf, page 17 ff.
- Prototype and initial evaluation paper: KrakeN: N-ary Facts in Open Information Extraction. Alan Akbik and Alexander Löser NAACL-HLT 2012
- link to the web demonstrator
- Technical demonstrator: http://cluster-inspector.appspot.com/ClusterInspector.html
- Demonstration of evaluation tool: http://cluster-inspector.appspot.com
- Technical demonstrator: http://cluster-inspector.appspot.com/ClusterInspector.html
The SCAPE Data Publication Platform
Description
The Data Publication Platform (DPP) provides a means to preserve, publish, and query the results of SCAPE experiments and workflows.
The DPP also records provenance, temporal and version information, i.e. who published the results and when.
Specifically the DPP provides an ifrastructure that implements the [http://www.lds3.org/Specification ], providing:
- an OAUTH2
module for registration and provision of authentication key-pairs.
- Document annotation via the Graphite Library
- An RDF Quad Store, 4store
- HTTP based Query (SPARQL
) and retrieval of datasets via Puelia
.
The platform currently holds a reference dataset, showing the results of different versions of format identification tools running against an open data set.
It is intended that SCAPE data sets cited in publications will be made publically available through the DPP allowing others to study and reproduce the results.
Source code
The SCAPE specific patches to the Puelia project for the DPP can be found on GitHub .
The code used to process and load the results of the format identification experiments is also on GitHub .
Public Example
The results of the format identification experiments, showing how format identification results vary with tool version and signature file version are available online .
Characterisation of office documents on the Azure platform
Work done by Microsoft Research Cambridge.
- A short description and a link to a longer description if possible.
Over the first 18 months Microsoft Research Cambridge has designed and implemented an Azure based architecture with functions that support a four step workflow for batch-mode document conversion: ingest and characterization of document collections, conversion, comparison, and reporting.
Initially, we focused on the conversion of common proprietary formats into XML based formats. However, we expanded the set of converters to support representations of documents in multiple formats in order to increase the value that the user can derive from digital content. Indeed, the value is increased through the use in multiple scenarios and on multiple platforms.
- Links to any available src and release tags — if possible. If this is impossible a description of why it that so.
- A description of what state the system is in as of the 31st of January. This is a replacement for a release and should as such be understood by the reader.
We have implemented a Web User Interface to SCAPE Azure services to support batch mode content conversion. SCAPE Azure v.2.0 is designed to record data processing information and use it to provide reporting services.
One important aspect of the UI is the comparison tool that enables the user to view a representation of both the original and the converted document. That is typically the XPS representation of each document.
Information about the difference between two document is provided based on the analysis of the rendered documents. They are indicated on the document viewers. Furthermore, statistics on various aspects of the documents are presented and compared.
SCAPE Azure portal is still under development and it will be available to SCAPE Partners soon, date is still TBD.
Evaluation and improvements of existing identification and characterisation tools
This work was done by British Library, Österreichische Nationalbibliothek, and Statsbiblioteket. Each institution had a tool to package and release. For each tool a SCAPE tool specification was created. That specification was then used as input for the SCAPE tool wrapper which as output created a Debian package and Taverna workflow including the tool.
Uploading a Debian package to deb.openplanetsfoundation.org is described in the SCAPE Developer Guide: http://wiki.opf-labs.org/display/SP/Submitting+Your+Package.
By "release tag" I mean the tag identifying a release of the software in terms of source control terminology. - Status of the wiki page http://wiki.opf-labs.org/display/SP/PC.WP1+checkpoint+CP070
Apache Tika
Work done by British Library.
1. Apache Tika is a toolkit that detects and extracts metadata and structured text content from documents using parser libraries.
Three Debian packages have been produced for the Tika tool. Two of these are wrapper packages created from toolspecs using the ToolWrapper tool :-
digital-preservation-characterisation-tika-app-parse2text
This takes a single input parameter, usually a filename* or URL, and outputs the extracted text content to standard output.
Example
digital-preservation-characterisation-tika-app-parse2text --i <filename>
*The input parameter may also be a Tika command line option e.g. “-i --g” will start the Tika GUI, “-i --V” will show the version number. Refer to the Apache Tika documentation for all options.
digital-preservation-characterisation-tika-app-wrapper
This is a more general version of the above which allows additional parameters, using the --p option, to be specified e.g.
Example
digital-preservation-characterisation-tika-app-wrapper -i <filename> -p -d
This outputs just the type of the specified input file e.g. text/plain.
See point 6 for the location of the toolspecs for the above packages.
Both wrapper packages are dependent on the third Debian package, tika-app-cli, which is a wrapper for the Tika command line interface (see point 5).
2. The homepage for Apache Tika is http://tika.apache.org/
3. The source code can be found at http://tika.apache.org/download.html and https://github.com/openplanets/tika
4. Release tag https://svn.apache.org/repos/asf/tika/tags/1.2/
5. A debian package, tika-app-cli, has been created to wrap the Tika jar file, https://github.com/openplanets/scape/tree/master/pc-as/debians/tika.
This consists of the Tika jar and a shell script, executeJar.sh that runs Tika.
6. Two toolspecs have been created,
7. The 3 debian packages are called
tika-app-cli_1.2_all.deb
digital-preservation-characterisation-tika-app-parse2text_1.0_all.deb
digital-preservation-characterisation-tika-app-wrapper_1.0_all.deb
and can be found at <to be added>
DROID
Work done by Österreichische Nationalbibliothek
- DROID is a software tool developed by The National Archives to perform automated batch identification of file formats. DROID (Digital Record Object IDentification) is a tool that attempts to identify digital objects using PRONOM format signatures (‘magic numbers’) and/or known file extensions. The identification results are reported as PRONOM-compliant Persistent Unique Identifiers (PUIDs). DROID is an open-source, platform-independent Java application. It can be used directly from the command line, or, alternatively, using a graphical user interface.
Until this checkpoint, Droid has been used in stand-alone mode without the SCAPE Platform. Only the command line version of the application with the standard interfaces have been used. In order to enable the integration with the Taverna Workflow Design and Execution workbench, scripts have been created that combine several Droid execution steps necessary to perform the identification of single files also allowing to traverse directories recursively. - The source code and further information about the project can be found at http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm
- The sources of the scripts for identifying the content of a folder recursively using the DROID tool can be found at http://digital-preservation.github.com/droid/
- The DROID release used for this work is https://github.com/digital-preservation/droid/tree/droid-6.1
. A list of all DROID tags is available on https://github.com/digital-preservation/droid/tags
- The commit at the point of releasing this package at Github is https://github.com/openplanets/scape-toolspecs/commit/c09b59a69618d4d0815a0598200d9a25adb23ce5
- The script that executes the Droid identification and which is packaged for the SCAPE Platform is available at https://github.com/openplanets/scape-toolspecs/blob/master/digital-preservation-identification-droid-folder2csv.sh
- The tool description that allows packaging the Droid tool for the SCAPE Platform is available at https://github.com/openplanets/scape-toolspecs/blob/master/digital-preservation-identification-droid-folder2csv.xml
- A debian package that has been used according to the tool specification file can be found at http://deb.openplanetsfoundation.org/pool/main/d/droid/
ffprobe
Work done by Statsbiblioteket
- A short description
- ffprobe gathers information from multimedia streams and prints it in human- and machine-readable fashion.
- A link to the home page of the tool
- A link to the src of the tool
- git clone git://source.ffmpeg.org/ffmpeg.git ffmpeg
- A link to the release tag used for this package.
- A link to the binary of the tool used for this package
- We used the OS specific package management system to get the above mentioned release of ffmpeg albeit with the caveat mentioned in item 7 below.
- A link to the tool spec
- A link to the SCAPE Debian package — if its available on deb.openplanetsfoundation.org
- XML formatted output was first available in version 0.10 of ffprobe. Debian Squeeze has version 0.SVN92, which does not have this output format. Before having a robust method to make available the required version of ffprobe, we will not release a Debian package on the repository.