Tell me about this digital object…
Does it contain known preservation risks?
Is it valid by the spec? Or my profile spec?
Preservation Action Components
Transform this digital object into this format…
Repair links, or remove preservation risks
Quality Assurance Components
Assess the differences between these two objects…
Assess the Preservation Actions
We want to reliably wrap command-line and Java-based tool in a lightweight manner, and then re-use the same wrapping at multiple levels.
- Local tool invocation - for development and testing.
- Distributed tool invocation - on the full SCAPE platform.
- Remote web service access - for Taverna and the testbeds.
- Remote RESTful access - for future Taverna and broad integration.
We aim to achieve this by writing simple structured documents that describe how to invoke a particular tool for a particular purpose - called tool specs. These provide command-line templates that can be re-used across the different contexts. Any extra work required to make the invocation succeed in a particular context will be written as deployment wrappers/client code that only needs to be modified per context and action type, not per tool.
To ensure that the semantics of the operations is understood across the different contexts, we build upon the Planets model of standardising the interfaces to the different actions, so that clients can consume a wide range of tools more easily. This is not really necessary for Taverna, but is very useful in the other contexts. However, if we just adopted exactly the same custom-parameter system as under Planets, this would actively work against Taverna as its existing WSDL-probing code would not be able to expose the web services parameters. Thus, on the web-service layer, we will make the WSDL expose the parameters more explicitly. Due to the way most web service frameworks are designed, this involves writing code that generates code, but this complexity can be minimised by re-using code templates. Fortunately, allowing extensible parameters and results is somewhat easier in the other contexts.
One of the most complex issues in the original Planets framework was the way we handled the actual digital object payload itself. We encouraged data to be passed directly to the service, and did not fully explore the options for passing data by reference. In SCAPE, we will pass by reference (URI) by default, and only support passing by value if we find we must. By adopting standard references (URIs), we can defer the optimal data handling to the scheme and so allow the set of supported schemes to vary over time without modifying the interfaces themselves. Initial schemes to support will be file:, data: (thus allowing small-scale pass-by-value), http:, https:, and some kind of HBase and/or HDFS scheme for the Platform context.
We have defined the command-line as the primary interface, and will map each action onto a Java interface, and then generate the clients and the WSDL and REST services from them. One tricky point here is how the Java interfaces should be handed the bitstream (or bitstreams). For simple bitstreams, then one could pass as a local File or as an InputStream. The former has the advantage of being seekable and can be used to define JHOVE2 'clumps' as there is a directory concept at this level. However, in the case where the source is not a File (e.g. retrieved over some protocol), then an InputStream is a much more appropriate mapping. It also means that the remote file does not have to be fully retrieved before processing can begin. Similarly, command line tools can be wrapped as pipes or as operating on files, but which is optimal will depend on the context.
The way we handle multi-bitstream entities may be the critical case that we design around. How would this work over http/hbase? Can it work? Where is the 'folder' or do those clumps only make sense on filesystems. If so, how are we supposed to pass a 'folder' to a remote service?
For Validation, see Unicorn http://code.w3.org/unicorn/wiki/Documentation/Observer
And they may be planning a validation appliance, http://lists.w3.org/Archives/Public/www-validator/2010Oct/0028.html
Hadoop data integration
Types & Interfaces
CLI and Java interfaces
Extensible Java method signatures & CLI templates
e.g. Identify must accept at least a digital object, and return at least a URI
Extra parameters may exist, but must have sane defaults
More flexible that in Planets
But tight enough that clients can call easily
Coded for local data and/or streams
More constrained than ‘vanilla’ Taverna use
Should align with Taverna Component efforts
Standard, extensible interfaces
Standard processes may include:
Identify, Characterize, Validate
We should document the logic on the SCAPE wiki.
Deployment helpers wrap this up to make it deploy in different contexts
CLI Invoker for local development and testing
JAX-RS RESTful service mapping
Also wrap benchmarking code around invocation
Interoperability, data Handling
Planets defaulted to pass-by-value
SCAPE will default to pass-by-reference (URI)
Leverage URI schemes to delegate issues like encoding and authentication to the transport layer.
More modular design, leveraging standard transports.
Java/CLI will expect local files or streams
Wrapper layer handles retrieving items via URI
Separation of concerns – wrapper could support e.g. HTTP(S), SMB/CIFS, FTP, SFTP/SCP, HBase URI, etc.
May modify or re-use JHOVE2 Source/Input objects
Interoperability, Data Formats
The required arguments passed to tools will be standardized via Java/JAXB
e.g. JHOVE2 property tree as Characterization result, mapped to and from XML
Some other concepts will also need standardization
Service description for discovery (WADL?)
The optional arguments need a declaration
Format identifiers for supported input/output formats
Passed through the TCC to review and disseminate
Interoperability: Sharing Concepts
Common concepts shared on the SCAPE wiki
Tool interface definitions
Both linked to the source code, headed for the JavaDoc
A SCAPE/OPF Registry
First understand what we really need for tool discovery and use, based on initial integration plan.
Then mix-in wider integration issues.
Define only format identifiers, or do more?
Track and merge with UDFR effort? Now or later?
CC Development Plan
Develop FITS, DROID, file etc.
For identification (including conflict resolution via FITS) and brief characterization
Do not support compound objects well
Develop JHOVE2 modules
For deep characterization, profile analysis, etc.
Supports compound objects
FITS as a JHOVE2 identification module?
CC Integrated Deployment
FITS and JHOVE2 have CLI interfaces, wrap as Tool Specs
Source URI in, Properties out
Property data to follow JHOVE2 form
e.g. normalize output using the JHOVE2 property language
Properties have URIs
RDF approach is compatible
CC Validation Interface
Re-use JHOVE2 assessment language for profile validation, if appropriate
If we need a Validation over REST, consider re-using the W3C Unicorn Validator interface.[http://code.w3.org/unicorn/wiki/Documentation/Observer|http://code.w3.org/unicorn/wiki/Documentation/Observer]
PA Integration Plan
Develop as standalone tools
Improving existing tools or making new ones
Initially Web Services
As Sven has been doing
Wrap standalone tool in Tool Spec, specifies input and output formats etc.
Use src parameter to pass input & create new resource
Return alone or with a report via Content Negotiation
QA Integration Plan
Develop standalone tools
Improving existing tools or making new ones
Re-use JHOVE2 property language for comparative properties.
Re-use JHOVE2 assessment language for profile validation?
RESTful Compare interface
Two URIs in: src1 & src2
Properties out: re-using JHOVE2 model.