compared with
Current by Andrew Jackson
on Jan 06, 2012 11:26.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (1)

View Page History
EARLY DRAFT/NOTES.

h2. Tool Integration

{color:#333333}Characterization Components{color}

{color:#333333}Tell me about this digital object…{color}
{color:#333333}Does it contain known preservation risks?{color}
{color:#333333}Is it valid by the spec? Or my profile spec?{color}

{color:#333333}Preservation Action Components{color}
{color:#333333}Transform this digital object into this format…{color}
{color:#333333}Repair links, or remove preservation risks{color}

{color:#333333}Quality Assurance Components{color}
{color:#333333}Assess the differences between these two objects…{color}
{color:#333333}Assess the Preservation Actions{color}

We want to reliably wrap command-line and Java-based tool in a lightweight manner, and then re-use the same wrapping at multiple levels.

* Local tool invocation - for development and testing.
* Distributed tool invocation - on the full SCAPE platform.
* Remote web service access - for Taverna and the testbeds.
* Remote RESTful access - for future Taverna and broad integration.



We aim to achieve this by writing simple structured documents that describe how to invoke a particular tool for a particular purpose - called tool specs. These provide command-line templates that can be re-used across the different contexts.  Any extra work required to make the invocation succeed in a particular context will be written as deployment wrappers/client code that only needs to be modified per context and action type, not per tool.

To ensure that the semantics of the operations is understood across the different contexts, we build upon the Planets model of standardising the interfaces to the different actions, so that clients can consume a wide range of tools more easily. This is not really necessary for Taverna, but is very useful in the other contexts. However, if we just adopted exactly the same custom-parameter system as under Planets, this would actively work against Taverna as its existing WSDL-probing code would not be able to expose the web services parameters. Thus, on the web-service layer, we will make the WSDL expose the parameters more explicitly. Due to the way most web service frameworks are designed, this involves writing code that generates code, but this complexity can be minimised by re-using code templates. Fortunately, allowing extensible parameters and results is somewhat easier in the other contexts.

[http://fue.onb.ac.at/scape-services/services/SCAPEOpenJPEG14Service?wsdl]


One of the most complex issues in the original Planets framework was the way we handled the actual digital object payload itself. We encouraged data to be passed directly to the service, and did not fully explore the options for passing data by reference. In SCAPE, we will pass by reference (URI) by default, and only support passing by value if we find we must. By adopting standard references (URIs), we can defer the optimal data handling to the scheme and so allow the set of supported schemes to vary over time without modifying the interfaces themselves. Initial schemes to support will be [file:], data: (thus allowing small-scale pass-by-value), http:, https:, and some kind of HBase and/or HDFS scheme for the Platform context.

We have defined the command-line as the primary interface, and will map each action onto a Java interface, and then generate the clients and the WSDL and REST services from them. One tricky point here is how the Java interfaces should be handed the bitstream (or bitstreams). For simple bitstreams, then one could pass as a local File or as an InputStream. The former has the advantage of being seekable and can be used to define JHOVE2 'clumps' as there is a directory concept at this level. However, in the case where the source is not a File (e.g. retrieved over some protocol), then an InputStream is a much more appropriate mapping. It also means that the remote file does not have to be fully retrieved before processing can begin. Similarly, command line tools can be wrapped as pipes or as operating on files, but which is optimal will depend on the context.

The way we handle multi-bitstream entities may be the critical case that we design around. How would this work over http/hbase? Can it work? Where is the 'folder' or do those clumps only make sense on filesystems. If so, how are we supposed to pass a 'folder' to a remote service?

REST

For Validation, see Unicorn [http://code.w3.org/unicorn/wiki/Documentation/Observer]

[http://code.w3.org/unicorn/wiki/Documentation/Observer/Response]

[http://code.w3.org/unicorn/wiki/Documentation/Observer/Contract]

[http://code.w3.org/unicorn/wiki/Documentation/Observer/Tutorial]

And they may be planning a validation appliance, [http://lists.w3.org/Archives/Public/www-validator/2010Oct/0028.html]

Hadoop data integration

[http://www.cloudera.com/blog/2009/02/the-small-files-problem/|http://www.cloudera.com/blog/2009/02/the-small-files-problem/]

[http://www.exmachinatech.net/01/forqlift/|http://www.exmachinatech.net/01/forqlift/]

Taverna Notes

[http://www.mygrid.org.uk/dev/wiki/display/developer/Calling+external+commands+from+Taverna]


[http://code.google.com/p/taverna/source/browse/taverna/engine/net.sf.taverna.t2.activities/branches/maintenance/external-tool-activity/pom.xml|http://code.google.com/p/taverna/source/browse/taverna/engine/net.sf.taverna.t2.activities/branches/maintenance/external-tool-activity/pom.xml]

[http://www.mygrid.org.uk/maven/repository/net/sf/taverna/t2/activities/|http://www.mygrid.org.uk/maven/repository/net/sf/taverna/t2/activities/]

[http://www.mygrid.org.uk/maven/snapshot-repository/net/sf/taverna/t2/activities/external-tool-activity/|http://www.mygrid.org.uk/maven/snapshot-repository/net/sf/taverna/t2/activities/external-tool-activity/]

h3. {color:#000000}{*}Types & Interfaces{*}{color}

{color:#333333}CLI and Java interfaces{color}

{color:#333333}Extensible Java method signatures & CLI templates{color}
{color:#333333}e.g. Identify must accept at least a digital object, and return at least a URI{color}
{color:#333333}Extra parameters may exist, but must have sane defaults{color}
{color:#333333}More flexible that in Planets{color}
{color:#333333}But tight enough that clients can call easily{color}
{color:#333333}Coded for local data and/or streams{color}
{color:#333333}More constrained than ‘vanilla’ Taverna use{color}
{color:#333333}Should align with Taverna Component efforts{color}

{color:#333333}Standard, extensible interfaces{color}

{color:#333333}Standard processes may include:{color}
{color:#333333}Identify, Characterize, Validate{color}
{color:#333333}Migrate/Transform/Convert{color}
{color:#333333}Compare, Assess{color}
{color:#333333}We should document the logic on the SCAPE wiki.{color}
{color:#333333}Deployment helpers wrap this up to make it deploy in different contexts{color}
{color:#333333}CLI Invoker for local development and testing{color}
{color:#333333}JAX-RS RESTful service mapping{color}
{color:#333333}Also wrap benchmarking code around invocation{color}

{color:#333333}Interoperability, data Handling{color}

{color:#333333}Planets defaulted to pass-by-value{color}
{color:#333333}Cumbersome, brittle.{color}
{color:#333333}SCAPE will default to pass-by-reference (URI){color}
{color:#333333}Leverage URI schemes to delegate issues like encoding and authentication to the transport layer.{color}
{color:#333333}More modular design, leveraging standard transports.{color}
{color:#333333}Java/CLI will expect local files or streams{color}
{color:#333333}Wrapper layer handles retrieving items via URI{color}
{color:#333333}Separation of concerns -- wrapper could support e.g. HTTP(S), SMB/CIFS, FTP, SFTP/SCP, HBase URI, etc.{color}
{color:#333333}May modify or re-use JHOVE2 Source/Input objects{color}

{color:#333333}Interoperability, Data Formats{color}

{color:#333333}The required arguments passed to tools will be standardized via Java/JAXB{color}
{color:#333333}e.g. JHOVE2 property tree as Characterization result, mapped to and from XML{color}
{color:#333333}Some other concepts will also need standardization{color}
{color:#333333}Service description for discovery (WADL?){color}
{color:#333333}The optional arguments need a declaration{color}
{color:#333333}Format identifiers for supported input/output formats{color}
{color:#333333}Passed through the TCC to review and disseminate{color}

{color:#333333}Interoperability: Sharing Concepts{color}

{color:#333333}Common concepts shared on the SCAPE wiki{color}
{color:#333333}Tool interface definitions{color}
{color:#333333}Data definitions{color}
{color:#333333}Both linked to the source code, headed for the JavaDoc{color}
{color:#333333}A SCAPE/OPF Registry{color}
{color:#333333}First understand what we really need for tool discovery and use, based on initial integration plan.{color}
{color:#333333}Then mix-in wider integration issues.{color}
{color:#333333}Define only format identifiers, or do more?{color}
{color:#333333}Track and merge with UDFR effort? Now or later?{color}

{color:#333333}CC Development Plan{color}
{color:#333333}Develop FITS, DROID, file etc.{color}
{color:#333333}For identification (including conflict resolution via FITS) and brief characterization{color}
{color:#333333}Do not support compound objects well{color}
{color:#333333}Develop JHOVE2 modules{color}
{color:#333333}For deep characterization, profile analysis, etc.{color}
{color:#333333}Supports compound objects{color}
{color:#333333}FITS as a JHOVE2 identification module?{color}

{color:#333333}CC Integrated Deployment{color}
{color:#333333}CLI{color}
{color:#333333}FITS and JHOVE2 have CLI interfaces, wrap as Tool Specs{color}
{color:#333333}REST API{color}
{color:#333333}Source URI in, Properties out{color}
{color:#333333}Property data to follow JHOVE2 form{color}
{color:#333333}e.g. normalize output using the JHOVE2 property language{color}
{color:#333333}Properties have URIs{color}
{color:#333333}RDF approach is compatible{color}

{color:#333333}CC Validation Interface{color}
{color:#333333}Format/profile validation{color}
{color:#333333}Re-use JHOVE2 assessment language for profile validation, if appropriate{color}
{color:#333333}RESTful version{color}
{color:#333333}If we need a Validation over REST, consider re-using the W3C Unicorn Validator interface.{color}{color:#006daf}\[{color}{color:#006daf}[http://code.w3.org/unicorn/wiki/Documentation/Observer]{color}\|http://code.w3.org/unicorn/wiki/Documentation/Observer\]

PA Integration Plan
{color:#333333}Develop as standalone tools{color}
{color:#333333}Improving existing tools or making new ones{color}
{color:#333333}Initially Web Services{color}
{color:#333333}As Sven has been doing{color}
{color:#333333}CLI{color}
{color:#333333}Wrap standalone tool in Tool Spec, specifies input and output formats etc.{color}
{color:#333333}REST{color}
{color:#333333}Use src parameter to pass input & create new resource{color}
{color:#333333}Return alone or with a report via Content Negotiation{color}

{color:#333333}QA Integration Plan{color}
{color:#333333}Develop standalone tools{color}
{color:#333333}Improving existing tools or making new ones{color}
{color:#333333}Re-use JHOVE2 property language for comparative properties.{color}
{color:#333333}Re-use JHOVE2 assessment language for profile validation?{color}
{color:#333333}RESTful Compare interface{color}
{color:#333333}Two URIs in: src1 & src2{color}
{color:#333333}Properties out: re-using JHOVE2 model.{color}