Current tools developed in other digital preservation projects like PLANETS and CRiB do not cope well with all
aspects of large-scale processing. Fault tolerance and workflow provenance, for example, must be part of what
the tools in SCAPE must offer. This task will adapt existing tools (identified in Task 1 of the WP10) to be compatible with the
large-scale characteristics of the SCAPE platform.
Based on the knowledge of the PT characteristics, a decision was made to go for a general approach based on the toolwrapper. This way, tools are described using a machine-readable language (XML, respecting a XML schema) and from that same description different outputs can be generated to be used in different contexts, like command-line or web services. This is achieved using automated tools (scripts and others).
New version of the toolwrapper (only bash wrapper for now): https://github.com/openplanets/scape/tree/master/pc-as/toolwrapper/
For now, only bash wrappers are generated. These are inside a Debian package, containing the bash wrapper, a man page and a single step Taverna workflow.
|These installation notes were tested on Debian 6.0.5|
1) Add KEEPS debian package repository (final version of the tools will be added also to the OPF debian repository):
2) Add debian-multimedia repository (needed to install handbrake-cli):
3) Update the list of packages known by apt (to add the packages from the recently added repositories):
4) Install all migration tools (using a metapackage):
5) See what migration tools have been installed:
Option 1 (use bash completion feature or other shell-like functionality):
Option 2 (use apt-cache):
1) Create a PDF containing the sentence "SCAPE project":
2) Create a PDF containing the sentence "SCAPE project" with a single command using pipes:
3) Test the identity funcion using PDFBox, i.e., create a PDF from the sentence "SCAPE project" and convert the output (PDF) back into text in order to compare the original text and the text extracted from the PDF created:
Note: Comparing the original text and the output of the execution of the previous command, we may say that the sentences are equal. But the true is they aren't. If we pipe the result to an "od -c" we see that a space and a line ending were added, in this case by PDFBox while converting text to PDF.
Workflows are located in /usr/share/doc/, one per folder with the exact name of the Debian package.
Therefore, and for an action tool called digital-preservation-migration-office-abiword-doc2html, the workflow can be retrieved issuing the following shell command:
- Develop script that based on the toolspecs publishes tools on the SCAPE component catalogue.
- Improve wrapped tools by adding more parameters (only a few set of tools allow parameters because they weren't described in the tool description).
- Wrap more action tools, especially, CC and QA tools.
- Make gap analysis to obtain metrics to assess the need of new tools to be improved.