Skip to end of metadata
Go to start of metadata
Deprecated
This wiki page reflects a development version of D2.2. It has been superseded by D2.3

Executive Summary

This report provides a detailed outline of the functional components that comprise the architecture of the SCAPE project, in particular trying to ascertain the interfaces required between each of these functional entities.  It should be read in conjunction with the Technical Implementation Guidelines [D2.1], which define project-wide technologies, coding best practices and recommendations for development environments and tools, and also in conjunction with other SCAPE deliverables as referenced throughout this document.  This report is primarily aimed at those wishing to get an understanding of the general architecture of the SCAPE platform and how all the various development activities combine together, but it also serves as a means to capture the direction of development and ensure all developers understand the interconnections surrounding their work.

Table of Contents 

Introduction

The SCAPE project is developing scalable tools, services and infrastructure for the efficient planning and execution of preservation strategies for large-scale, heterogeneous collections of complex digital objects. Through this, digital preservation state-of-the-art will be enhanced in three ways:

  • by developing infrastructure and tools for scalable preservation actions; 
  • by developing a framework for automated, quality assured preservation workflows; 
  • by integrating these components with a policy-based preservation planning and watch system. 

To achieve these advances, work is broadly divided into a number of key sub-projects, Planning and Watch, Platform development, and Preservation Components development, coupled with validation of these developments through three testbed repositories.

Planning and Watch have two main aims: firstly they aim to provide a mechanism for developing, monitoring, executing and evolving preservation plans; associatively, they also work to provide the mechanisms to monitor the content itself, designated user communities and other systems in order to provide actionable triggers for the planning component. 

Platform are developing the infrastructure, aiming to enhance the computational throughput and storage capacity of digital object management systems through varying the number of computer nodes available in such systems. Such parallel approaches to data processing is imperative as sequential processing of large data sets in a reasonable time is not feasible.

Finally, Preservation Components are developing and enhancing tools and preservation action services to meet the scalability challenges and address the needs of the SCAPE platform architecture.

This report aims to provide an overview and direction of the technical status of the SCAPE project, giving overview details about the main components and how they are expected to interact. In particular it aims to identify the main interface points between the various components based on the architecture as understood at the time of writing.  It does not attempt to capture all background knowledge and experimentation used to drive the technical choices, instead associated background documents and reports will be referenced where appropriate.

Chapter Project Architecture provides a component-oriented overview of the SCAPE project architecture, highlighting the main functional components along with the expected interfaces between them. Each of these main functional components is then described in further detail in the subsequent chapters, providing relevant technical information, where known, and referencing specifications where these have already been defined. In particular, each component's main interfaces are described, details about how it is expected to be packaged and deployed are given, as well as a brief overview of future SCAPE milestones and deliverables of relevance.

Open Source Development

Where possible, and where it makes sense, existing software and tools shall be developed and enhanced to add the functionality required by SCAPE. Changes should be offered back to the original development branch of the software, where it makes sense to do so, in order to encourage wider community support for maintaining the enhancements and to enable them to persist beyond the scope of the SCAPE project.  As a good example, SCAPE have already enhanced Apache Tika and successfully pushed these enhancements back into the main Apache Tika release.

Some enhancements may not be pertinent for such wider release however, in particular preservation specific enhancements which are not aligned with the original open source tool's agenda or roadmap. In such cases, SCAPE will have to develop and maintain its own fork of the code base, ensuring that this is kept synchronised with the original code base (i.e. the fork should be kept up to date with the original code base).

Agile Development

Many of the SCAPE software development work packages employ itereative development practices closer to Agile software development methodologies than a more traditional, so called, waterfall methodology. The project is not perscriptive as to the methods individual work packages or institutions employ, preferring to allow them to work as they would normally. However, the deliverables, milestones, and checkpoints from the project's Description of Work favour the release of early, simple prototypes which are refined through iterative cycles of development and testing, as well as integration testing with other components where required.

The main reason for this is the research nature of much of the projects software development. It would be difficult to be confident in a complete, up-front design process, when so many of the activities are trying to establish what is possible in terms of scale and funtionality. The other reason is simply that many of the partners are more comfortable working in this manner.

This should be borne in mind when reading this report, as there are areas where the exact manner in which a piece of functionality will be implemented, or the precise definition of an API is yet to be finalised. Many of the sub-projects and work packages have produced, or are in the process of producing early iterations of components. Testing of these, both individually and as an integrated whole, will inform and shape the final architecture.

In this spirit this Architectural Report is not final, but is a snapshot in time. The document will be updated as progress is made and individual component designs are adapted and finalised. There will be a second official release of the Architecture Document in M30 of the project (Deliverable D2.3).

Abbreviations

The following abbreviations are used throughout this report.

Table 1: Abbrevations

Abbreviation Description
AIP Archival Information Package
API Application Programming Interface
CQL Contextual Query Language
CSV Comma Separated Values
DIP Dissemination Information Package
DOM Digital Object Model
DOR Digital Object Repository
DROID Digital Record Object IDentification
GUI Graphical User Interface
HDFS Hadoop Distributed File System
HTTP HyperText Transfer Protocol
JSON JavaScript Object Notation
METS Metadata Encoding and Transmission Standard
PPL Program for parallel Preservation Load
PREMIS PREservation Metadata: Implementation Strategies
REST REpresentational State Transfer
SCAPE SCAlable Preservation Environments
SDK Software Development Kit
SIP Submission Information Package
SOAP Simple Object Access Protocol
SRU Search/Retrieve via URL
SSH Secure SHell
TCK Technology Compatibility Kit
URI Uniform Resource Identifier
WSDL Web Service Definition Language
XML eXtendable Markup Language

Project Architecture

This section provides an overview of the SCAPE platform architecture, as perceived at the time of writing. It describes the top-level and sub-level components envisaged to be needed and, importantly, the interfaces required between them. These interfaces and components are described in detail in the following sub-sections. Again it should be kept in mind that the components are currently being developed and are therefore subject to change.

At a high level the SCAPE project consists of a number of interconnecting components each handling specific aspects of functionality to ensure the preservation of digital objects stored within a Digital Object Repository (DOR). The Execution Platform manages and runs parallelised scientific workflows responsible for expedient and reliable execution of preservation actions on some data set within the DOR, and ensuring the validity of the outcome. For example, a workflow might migrate all files of one format to another format and ensure that the relevant significant information is maintained; or it might perform file identification and characterisation on a large set of files using a tool such as DROID or Apache Tika. The outputs of such executions will filter back to the DOR (i.e. files of new formats) and/or be stored and published in a repository, i.e. the SCAPE Data Endpoint.

Both repositories are used as sources of information for the Preservation Watch component (SCOUT), which builds and monitors a view of the world based on its input sources in relation to institutional policies. Watch constantly update information from its sources, reassesses its world view and notifies the Planning tool if some predefined threshold to some criteria is met, for example, the cost of some specific migration software may have reduced to an acceptable level, or the number of files of a particular format may have increased to a level where appropriate preservation planning action should be triggered. The responsibility for setting these criteria, as well as creating, monitoring and testing preservation plans lies with the Planner(s) via the PLATO planning tool, the Web-based Analysis Tool, and the Repositry Simulator.

Preservation Plans are built from preservation tools and workflows (Components) defined and stored in the SCAPE Component Catalogue (myExperiment) and tested on the Taverna execution environment. The outcome of tested plans can be compared and "successful" plans can be parallelised and uploaded to the Digital Object Repository for execution. Components are created through the Taverna Workbench Workflow Modelling Environment and added to the Component Catalogue where they can be searched for.

The Digital Object Repository is responsible for initiating the execution of a Preservation Plan on some dataset it contains. A SCAPE Plan Management GUI is defined to enable user interaction to control such executions, along with a Loader application to upload information to the DOR in accordance with the SCAPE Digital Object Model for data.

Figure [1] captures these components and sub-components that form the SCAPE platform, along with the components they interact with. Such interactions are defined through the various Interfaces (APIs) that can be seen. Components coloured light red are Sources to the Watch component and therefore also implement the Sources' API (unless otherwise stated in the discussion that follows). The green "User Agents" box reflects that those contained components are key components offering interaction (GUI, command line interfaces, etc.) with the user via exposed APIs.  This does not preclude that other components may also offer user interaction (for example, PLATO or its web-based analysis tool).

Figure 1: SCAPE Component Architectural Diagram [v13, 1/09/2012]

To highlight the responsibility for each API, Figure [2] shows a top-level component architecture, with the interfaces colour coded to match the component responsible for their implementation. For example, the Digital Object Repository is responsible for the Report, Plan Management and Data Connector APIs. It should also be noted that, unlike Figure [1], this diagram does not show which components are Sources also implementing the Source Access Pull API.

Figure 2: Colour coded top-level architectural diagram highlighting which component is responsible for which API [v2, 14/08/2012]

Automated Watch Component

Introduction

The aim of the Automated Watch work package is to substantially improve automated support for effective digital preservation watch. Specifically its aims are:

  • Increasing the breadth and scope of collected digital preservation information.
  • Normalizing and structuring gathered information into a queryable Knowledge Base.
  • To allow both human and automated systems to query and monitor information in the Knowledge Base.
  • Developing software components to monitor information in the Knowledge Base for significant events.

Functional Overview

The Automated Watch component is an automated information gathering and monitoring system that:

  • Gathers information relavent to digital preservation activities from potentially any source of information of interest to the planner, e.g. repositories, technical registries .
  • Allows software agents and human operators to add relavent information.
  • Allows software agents and human operators to ask questions about gathered information through Watch Requests.
  • Provides an automated monitoring system that looks for changing information and re-answers questions to a schedule.
  • Assesses answers against conditions and triggers submitted in watch requests, and alerts external agents when conditions are met.
  • Provides a Repository Simulator that analyses the information gathered from a users repository and projects its future state facilitating timely Preservation Planning for upcoming requirements.

Technical Overview

A three-tier architecture has been developed, depicted below:

The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner. Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor. The system adds provenance information and references to equivalent entities and properties held in the knowledge base. The knowledge base also contains pre-defined questions that can be answered from information held in the knowledge base. A planner can submit a Watch Request either synchronously or asynchronously, via a Client Service in order to query specific measurements of interest and receive notification when conditions are met. The query, conditions and means of notification are all parts of a Watch Request. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests tell the Watch component to monitor for changes in specific measurements (by specifying Conditions) triggering a Notification such as an email, to the requesting client when such a change is detected. This approach does not block the requesting client.

Further details about the Automated Watch system are provided in WATCH CONCEPTUAL DELIVERABLE, ICADL PAPER.

The following sections discuss the various sub-components involved in the Watch Component and how they interact.

Sources

Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world and provides measurements of the properties associated with it, and can be internal or external to the project. Key sources currently considered are:

  • Format Registries
  • SCAPE Preservation Components catalogue (MyExperiment)
  • Policy models
  • Repositories
  • Experimental Results
  • Content Profiles
  • Human Knowledge
  • Web Browser snapshots (being developed within Watch)
  • A Repository Simulator that predicts upcoming preservation issues based upon repository trends (being developed within Watch)

Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the Source Access Pull API or the REST Source Push API. An exception would be the Digital Object Repository which implements a Report API for interaction with a relevant Source Adapter.

Source Adapters

Functional Overview

A Source Adapter gathers information from a Source and transforms it to the Entity/Property model adopted for the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.

The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors described below.

PRONOM Adaptor - A Reference Format Registry Adaptor

There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into Entities/Properties for passing on to the Merging and Linking component. The Adaptor code is part of the SCOUT project and can be found here .

C3PO Adaptor - A Reference Content Profile Adaptor

C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the FITS tool and aggregates into a MongoDB document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving aggregated profile data. The C3PO project an be found on GitHub .

The C3PO adaptor reads and parses the XML data retrieved from the C3PO REST API and retrieves a subset of the content profile data. The data is converted into the Watch Entities and Properties before being passed to the Data Merging and Linking component, which enriches the data before inserting it into the Watch Knowledge Base.

Both of these adaptors are in an early state of development, but provide examples of how to develop a plug in adaptor for the Preservation Watch system.

Other Source Adaptors

While there could be adaptors written for many different sources of information, the following will be developed as part of Automated Watch:

  • RODA Adaptor & eSciDoc Adaptor
    These are both repository adaptors that will gather information about a users collection, e.g. Content Profiles.
  • Component Catalgue Adaptor
    The component catalogue adaptor will gather data from the SCAPE Preservation Component Catalogue via the Component Lookup API. The information gathered in the Knowledge Base will allow planners to find tools that meet their planning requirements, or alert them when tools that provide new functionality required for preservation activities that had not previously been available.
  • Policy Model Adaptor
    A policy model adaptor is assumed to be the form of the Preservation Watch component required to incorporate the Machine Interpretable Policy Model into the Automated Watch Knowledge base. This source will provide data that allows the planner, or the Automated Planning system to monitor an organisation's preservation policy and receive notifications when there are significant policy changes.

Source Adaptor Manager

Functional Overview

In order to control the scheduling of Source Adaptors used by a particular Automated Watch instance a Source Adaptor Manager is being developed. This manager will allow an operator to:

  • Install new adaptors.
  • Upgrade installed adaptors.
  • Enable or disable installed adaptors.
  • Manage the scheduling of adaptors, i.e. how often the external sources are queried.

Technical Overview

ToDo
Ask Luis upon his return whether this is just an API or will it be supported by a management GUI also.

Data Merging and Linking

Functional Overview

This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base. The source adaptors convert data to fit the internal data model, but different source adaptors may present contradictory, or incompatible data. The merging and linking component further process incoming data by:

  • Merging data.
  • Resolving inconsistencies between data sources.
  • Providing additional cross-references between entities and properties.

It is the addition of the cross-references that will enable rich queries of the Knowledge Base. A simple example of the added value by the merging and linking component can be given by considering 3 simple sources:

  1. A MIME based format registry
  2. The PRONOM format registry
  3. A collection profile consisting of file format distribution of the collection

The format records gathered from the registries would need to be linked so that PRONOM IDs were linked with the appropriate MIME records. Additionally the file format distribution records would be linked to the appropriate format entities gathered from the format registries.

Technical Overview

The component provides an internal Delegate Data interface used by Source Adapters and the Push Source API to push information to it, and makes use of the Knowledge Base's Submit Data API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.

Questions
How will the interaction between the source adaptors and the Merging and Linking layer be managed? Specifically when a new Adaptor is developed, or an unknown source uses the Push API, how will the Merging and Linking layer know:
  • How new entities / properties relate to others in the Knowlege Base?
  • Which sources take prescedence over others with regard to resolving inconsistencies?

Knowledge Base

Functional Overview

The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely Submit Data and Access Data.

A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process. The Knowledge Base also stores all of the questions posed by software agents or external users.

Technical Overview

It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.

The Knowledge Base uses Apache Jena , a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.

Watch Client Web GUI

Functional Overview

This is a web interface which provides the following functionality for planners:

  • The manual addition of information to the knowlege base.
  • Browsing of the knowledge base.
  • The Submission of Watch Requests to the Automated Watch system.

Technical Overview

The Watch Client GUI is a Java Web application, packaged as part of the Automated Watch web application. The GUI provides and interface that allows the user browse the information in the knowledge base and to add new information through the Watch Push API.

The GUI also provides a means by which human operators can submit Watch Requests. Watch requests consist of:

  • One or more pre-defined Questions that assess some aspect of the world.
  • One or more Triggers that define boolean conditions to be tested against the answers to the Watch Request's question set.
  • One or more notifications that will alert external agents if the trigger conditions are met.

Monitor Services

Functional Overview

Monitoring services observe one or more information sources and re-calculates answers to questions held in the Knowledge Base, when the results rely upon external information that has changed. These questions are predefined points of interest related to the information gathered from the sources. An example based upon a Component Catalogue source adaptor might be the number of tools fulfilling a particular criteria. As new tools are added to the catalogue the information will be gathered by the source adaptor, which in turn would be picked up by a Monitor Service. The service would then recalculate the answer to the question based upon the new information in the knowledge base.

Technical Overview

The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a Data or Question Changed interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the Assessment Service.

Assessment Service

Functional Overview

The assessment service is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base, and the conditions associated with the Watch Request via the triggers. There are two types of assessment, a preliminary assessment which is a simple test of a boolean condition contained in a trigger. This may be all that is required for basic Watch Requests, if the trigger conditions are met by simple assessment test then the trigger will fire and notify external agents that an external assessment is required.

Technical Overview

The Assessment Service is a part of the automated watch Java Web Application (the SCOUT project).

Access to the Knowledge Base is provided by the internal Access Data interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred. In many cases conditions to be assessed will be more complex than this requiring an external assessment service such as that offered by the Automated Planning Component through its assessment API. This would allow an existing preservation plan to be re-evaluated in the light of the new information, and assess whether the new state required action.

Notification Service

Functional Overview

The notification service is responsible for informing external entities of significant events as defined by the monitoring and assessment services. When the Monitor sub-component detects a significant event, based upon the the questions and conditions stored in the Knowledge base, the Notification Service is used to alert interested parties. An interested party might be a human planner informed by email, or a software agent, typically the Automated Planning component.

Technical Overview

The Notification Service is again being developed in Java, as part of the automated watch SCOUT project. The notification component is extensible to allow different types of notifications to be offered, for example email, or HTTP API.

The Repository Simulator

Functional Overview

The Automated Watch work package are also developing a repository simulator. This component will analyse repository metadata held in the watch knowledge base, and project the future state of the repository. Trends that may be detected might be accelerating storage requirements, or an increasing number of items in a particular format. Information about computational resources required to execute a preservation action across a set of content might also be analysed to establish how long it might take to run the action, or indeed establish if such a course of action is feasible.

Technical Overview

Required Interfaces

The Automated Watch component must implement two external facing APIs: the Push Source Adaptor API and the Watch Request API.

Push Source Adaptor API

The Automated Watch Push API provides a means for third party software agents to add information to the Watch Knowledge base without the development of a Source Adaptor. Note that push sources will not be controlled by the Source Adaptor manager, so that scheduling, and indeed enabling / disabling an unwanted push sources will have to be done by other means. The API may also be used internally by the Watch Client Web GUI to add new information via the web front end.

In the push model, the Source will send information to the Watch component as and when it becomes available. Software must be developed for the Source component to achieve this, which in some circumstances this not be possible. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal Adapter Configure interface.

Question
How will push sources be controlled? Specifically:
  • Will there be a security layer to stop unauthorised services from pushing data to the Knowledge Base, and what form will this take, e.g.Basic HTTP Authentication?
  • How will the frequency of push requests be handled to stop external agents from "spamming" the Knowledge base with over-frequent updates?
  • Can the push API be at least turned off if necessary, due to faults in external agents pushing unwanted information?

Watch Request API

This API will be used by external software agents to submit Watch Requests to the Automated Watch system. Typically the software agents will be:

  1. The Watch Client Web GUI .
  2. The Automated Planning System.
    A watch request is made up of a number of pre-defined questions, drawn from the watch knowledge base, and a number of triggers. Triggers are assessed against questions and, if the trigger conditions set in the watch request are satisfied, the planner or software agent is notified, for example by email.

Both APIs are being implemented as RESTful services deployed with the Automated Watch Java Web Application.

Packaging and Deploying

The Automated Watch System is Java Web Application built from the GitHub OpenPlanets SCOUT Maven Project and deployed as a Web Application Resource. The project does rely upon the JBOSS Java EE 6 library so may require a dedicated JBOSS server, rather than a Tomcat Servlet.
RESTful services are provided through JERSY an implementation of REST services for Java, and part of the GlassFish project.

ToDo
Find out whether the WAR will deploy on a servlet container, i.e. Tomcat, or only on a JBOSS server

Roadmap

Table : Upcoming Automated Watch Milestones/Deliverables

Milestone/Deliverable Description Due
MS55 First prototype of the simulation environment M20
MS56 First version of the preservation watch core services M22
MS57 First prototype of the watch component delivered including adaptors for repositories, and Web content M28
D12.2 Final version of the Preservation Watch Component due M38
D12.3 Final version of the Simulation Environment M42

Automated Planning Component

Introduction

With current state-of-the-art procedures in digital preservation we can define organisational contraints and we can create plans that treat a homogenous sub-set of a large repository. PLANETS defined a Preservation Plan as follows:

A preservation plan defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records (called collection). The Preservation Plan takes into account the preservation policies, legal obligations, organisational and technical constraints,user requirements and preservation goals and describes the preservation context, the evaluated preservation strategies and the resulting decision for one strategy, including the reasoning for the decision. It also specifies a series of steps or actions (called preservation action plan) along with responsibilities and rules and conditions for execution on the collection. Provided that the actions and their deployment as well as the technical environment allow it, this action plan is an executable workflow definition.REF

PLANETS also produced a preservation planning methodology, a structured workflow for creating, testing and evaluation preservation plans. The PLATO planning tool developed within PLANETS follows this workflow to build preservation plans. PLATO produces an executable preservation plan along with audit evidence documenting the decision making procedures used in creating the planREF. However the plans were:

  • Largely constructed manually, which could be a time intensive procedure.
  • Were not normally applicable to all of an organisations holdings, but were restricted to a, normally homogenous, sub-set of a collection.
  • Were not deployed and executed automatically in a repostitory.
  • Had to be monitored manually for changes in best practice, collection profile, etc.

Further no mechanism exists to relate preservation policies to preservation plans, correlation has to be done manually.

The goals of SCAPE are to provide an automated planning component that is informed by:

  1. The accumulated knowledge of previous preservation plans.
  2. An organisation's digital preservation policy.
  3. An organisation's digital collections.
  4. Other queries performed on the Automated Watch Knowledge Base, e.g queries of File Format Registry information.

Overview

The Automated Planning component comprises three sub-components:

  1. The Plato Planning Tool.
    Building upon the existing PLATO tool but using the Watch Component, the Policy Model and content profiles to automate the creation of preservation plans.
  2. A machine interpretable model of preservation policy elements.
    Modelling preservation policies from the top down as a catalogue higher level policy elements, and from the bottom up as a machine interpretable model of actionable low level policy elements in order to inform and automate the planning process, and provide information to the Watch Knowledge Base.
  3. An web based analysis tool for mining the results of previous preservation plans.
    A web GUI that can be used to query past preservation plans and provide decision support to the planning process.

The Automated Planning Tool

Functional Overview

The SCAPE planning component continues the development of the PLATO Planning tool used in the PLANETS project. As described the PLANETS PLATO tool was capable of producing executable preservation plans to an established preservation planning methodology. These plans had various shortcomings described in the introduction.

The planning tool addresses these by adding automated support throughout the established process. The manual GUI will still be supported for low level plan editing where necessary, or indeed if the user prefers it. However modules are being developed that will populate forms with reccommened content during each planning step. This will mean a lighter planning process with less labour intensive form filling, and a lighter GUI.

A fast-track planning mode will also be introduced, this reduces user choice and reduces the 14 planning steps to 4 phases, with a single GUI page for each phase.

Technical Overview

The key to a quicker simpler planning process is the ability to import relavent information from other sources to support or automated the decision making process at the appropriate stage. For example:

  • a policy model and a Watch trigger provide the planner with all of the information required to describe the plans institutional context, the first planning step.
  • an XML content profile can be uploaded that completes the PLATO "define samples" page.

The business logic in Plato has been refactored so that flexible configurations of the workflow are quicker to implement, giving the option to create the light weight planning options described in the Funtional Overview.

The PLATO tool developed in Planets contained an embedded tool execution engine, known as minimee. This allowed users to test and compare different tools, or different configurations of a tool against their planning requirements. While this will be part of the initial iterations of the planning tool, the aim is to integrate with the SCAPE Component Catalogue to encourage the use of established or experimental Taverna workflows developed by the Testbeds. Currently mimimee offers real measurement of resource usage by tools, information not yet available from the Component Catalogue, or Data Publication Platform. As richer information becomes available from SCAPE components it is envisaged that the minimee platform will be switched off.

The automated planning work package is responsible for the development of of the the software component that imports instances of the machine interpretable policy model into the planning tool.

Web-based Analysis Tool

Functional Overview

This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the planner along with a number of features facilitating systematic analysis.

Technical Overview

The analysis tool is a separate GUI from the planning tool, and while the indicators offered by the tool will be used for some of the planning automation modules, the GUI itself will not be part of the planning workflow.

Machine Interpretable Policy Model

Functional Overview

Preservation policies are governance statements that constrain or drive operation Preservation Planning but may also have other effects outside of operational planning. For Planning and Watch policy elements have been divided into 3 classes:

  1. Guidance Policies
    • strategic, high level policies
    • are expressed in natural language
    • can't be expressed in machine interpretable form and require human interpretation
  2. Procedural Policies:
    • model the relation between guidance policies and control policies
    • can be represented in a formal model as the relation between guidance and control policies
  3. Control Policies:
    • are specific and can be represented in a semantic model

Only the control policies are guaranteed to be represented in the machine interpretable policy model. The development of the machine interpretable policy model is led by the development of a catalogue of policy elements.

Technical Overview

The policy element catalogue provides a semantic representation of generic policy elements that is understandable by preservation systems. The intital version of the policy catalogue lists a set of Guidance Policies which, by definition, will not appear in the machine interpretable model in their full form, as a table in Deliverable. Instead these must be broken down into sets of Procedural Policies, which in turn will be represented by sets of Control Policies that will be used to create the machine interpretable policy model. The iterative process of refining the catalogue will be undertaken by using the catalogue to express the real preservation policies of three partners representing the needs of Large Scale Digital Repositories, Web Archives, and Scientific Data Sets. Once validated the catalogue will be used to develop the machine interpretable model.

The machine interpretable policy model provides a source for the Automated Watch system and will inform the Automated Planning system. Standard tools such as RDF/OWLREF will be used to define the terms used to describe and represent Control Policies and support policy reasoning. Similarly to the catalogue, the policy model will undergo an iterative process of testing and refinement while been used to model the various Testbed scenarios.

One purpose of the policy catalogue is to guide organisations in creating their own complete policy model. The iterative processes described above will be used to improve this process by either extending an existing tool, or developing a simple editor or a domain-specific language that is easier to write, if this is feasible. The process is complex, the aim is to simplify it as much as possible.

Required Interfaces

The Automated Watch component must implement two APIs, the Notify API and the External Assessment API.

There are no recognised interfaces developed as part of the policy modelling workpackage. The Automated Watch and Automated Planning components are both responsible for developing software components that will interpret the model and base decisions upon the policy elements. The Policy Modelling work package is responsible for ensuring technical interoperablility between these components and the policy model.

Notify API

The Notify API will be used by the Automated Watch component to inform the Planning Component when significant events occur, that have been defined through Watch Requests. Examples would be a notification that a plan requires updating, due to a change in the state of a repository, or that a new tool is available that provides previously unavailable functionality required by a preservation plan.

Assessment API

The Assessment API will be used by the Automated Watch component make external assessments, that are more complex than the state of the boolean conditions checked by Automated Watch's own assessment services. The API will provide access to existing preservation plans, in order to match the simple Questions and their conditions to criteria and evaluate whether the changes result in a revaluation of alternatives. The basis for the assessment will rely on a utility function as provided by the planning component.

Packaging and Deploying

The Automated Planning Tool is a JBOSS SEAM application packaged as a Java Enterprise Application Resource (.ear file).
There is a central instance of the Planning Tool hosted by TUWIEN , that is currently running version 3.0.1. The latest release will continue to be hosted here.

Organisations wishing to host their own PLATO instance would first require a JBOSS server that has been installed and set up seperately on which to host the application.

The Web Analysis module is a separate Web Application Resource file within the Planning Suite ear. It will be available through the PLATO central instance as this the only instance that hosts the collection of previous Preservation Plans to analyse.

ToDo
The details of the central instance should be double checked before the doc goes to the EC.

There is a GitHub project where the semantic model of low-level Control Policies is being developed. The project contains:

  • The current version of the policy model ontology.
  • Some example properties, criteria, ojectives, and scenarios.
  • Some experimental queries developed in Java.

Roadmap

Table : Upcoming Automated Planning Milestones/Deliverables

Milestone/Deliverable Description Due
MS61 Initial version of automated policy-aware planning component M18
D13.1 Final version of policy specification model M30
MS62 Automated policy-aware planning component v2 with full lifecycle support M32
D13.2 Catalogue of preservation policy elements M36
MS63 Report on compliance validation M40
D14.2 Final version of automated policy-aware planning component M42

SCAPE Component Catalogue

Preservation Plans developed and tested within the Planning Component utilise SCAPE Components to provide preservation tools and actions. A SCAPE Component is a Workflow designed for execution on the SCAPE platform that, most likely, wraps a tool execution. For example, a SCAPE Component may exist to run DROID or Apache Tika file identification over a digital object. These Components are stored in the SCAPE Component Catalogue for: i) monitoring by the Watch component (it is therefore a Watch Source); ii) discovery and use by the Planning Component; and iii) compilation into Parallel Preservation Components for execution by the Execution Platform.

SCAPE are tasked with producing these Components, so that tools and workflows can be used on the SCAPE infrastructure.  This may simply be providing a SCAPE Component (i.e. a workflow) to run DROID, for example, however, SCAPE also has to address scalability issues in order to enhance the state-of-the-art in digital preservation.  Part of this is to develop and enhance the tools for scalable preservation actions, so for example, whilst many tools already exist that can be applied to aspects of digital preservation, such as JHOVE, DROID, FIDO, these existing tools have not been designed with the SCAPE execution platform in mind; they need adapting and enhancing so that they work effectively with the SCAPE platform.  In addition to this is the potential need for new or enhanced tools as required by new workflows, for example to enhance Quality Assurance for some specific dataset (e.g. audio files from radio broadcasts), developed by the Testbeds work packages.

Within SCAPE, Component development is broadly split into three sub-groups: Characterisation Components; Action Services Components; and Quality Assurance Components. Characterisation Components look to develop and enhance tools to identify, validate and characterise files. Current tools are only able to detect a limited number of formats, operating locally and at small scale. To perform against large, heterogeneous collections, the tools need to be adapted to fit SCAPE's distributed and parallel architecture. Action Services Components focusses on analysing and improving the interfaces and internal functionality of existing preservation action tools to enable them to cope with real sized collections as well as with compound objects (such as container objects). They will also develop new tools where necessary and ensure that these and existing tools can be deployed and used within the SCAPE platform. Finally, Quality Assurance Component's focus is to develop automated, scalable quality assurance methods and tools for a range of workflows defined by the Testbeds group. This will help automate the quality assurance processes, removing reliance on human intervention, and enabling the scaling required to handle large and heterogeneous collections.

Workflow

A workflow is a sequence of steps or operations on some input that execute according to the defined flow and combine to perform some complex operation, for example a workflow may take an file location URL as input and pass this to DROID to identify the specified file, the output of which is returned to the user. Building upon this, this workflow may be used as part of another workflow where the identification output is used to control flow within the larger workflow, for example, if a file is identified as an image file, it may undergo optical character recognition before and after file format migration with a comparison to provide some metric on the quality of the migration.

SCAPE use Taverna Workflows [23], making use of the Taverna Workflow Modelling Environment for users to produce workflows using a GUI. In general, these workflows can invoke SOAP/WSDL or REST web services, local Java code, external tools via SSH, or other sub-workflows. SCAPE have developed a "toolwrapper" to wrap local tools as web services for use within workflows [REF TO GITHUB REPO].

Taverna Workflows are hosted in the SCAPE Component Catalogue (myExperiment [8]), where they will be a Source for the Watch component and can be searched for and utilised by the Planning tool. In particular the Planning tool will be integrated with a Taverna engine to allow testing of different workflows to identify the most effective (according to some criteria) for a particular preservation concern. For example, what is the best approach to migrating some JPG image files to an appropriate preservation format? This Planning Tool will utilise a Platform node (not a cluster) to run Taverna Workflows using the Taverna Engine. When a decision has been made as to the appropriate workflow to run, the workflow is transformed into Parallel Preservation Components (e.g. MapReduce programs) for execution on the SCAPE Execution Platform.

Component Profiles

To enable interoperability between tools, automation of preservation processes, and discoverability by planning and watch, SCAPE Components need to provide a standardised interface. Such an interface is provided by the Taverna Workflows adhering to defined input/output interfaces and by annotating them with a common, standardised vocabulary. Defined combinations of interfaces and annotations form SCAPE Preservation Component Profiles; SCAPE have already defined a number of these profiles, extending common ports and annotations, for: migration action components; characterisation components; quality assurance object comparison components; quality assurance property comparison components; validation components; and executable plans. These Profiles are defined in [15].

A Profile has four different areas to check:

  1. Input ports: Expected input ports of the workflow;
  2. Output ports: Expected output ports of the workflow;
  3. Taverna Activities: Taverna activities that must be present for the workflow, e.g. external tool services that are used;
  4. Annotations: Workflow level annotations.

As an example, consider a Migration Action Component, with respect to annotations it builds upon the common elements such as the Component's name, version and ID (a full list can be seen in [15]) with details about the Migration Paths that this component supports, i.e. the file types this component can migrate from and to. The profile also specifies the particular input ports that must be defined, where path_from and path_to specify the path of the file to migrate and the path to migrate it to, as well as a parameter input port to detail any specific options to apply (e.g. tool specific command line options/flags). Output ports are also specified, specifically the path_from and path_to, which have the same meaning as before. Finally the Taverna Activity defines the external tool service used to perform the migration.

The Taverna Workflow Modelling Environment will provide a means, the Component Profile Validator, to validate these components against the defined profiles.

Required Interfaces

The Component Catalogue must implement two APIs: the Component Lookup API and the Component Registration API. Both these APIs will be integrated into myExperiment's (the SCAPE Component Catalogue's) REST API [16].

Component Lookup API

The Component Lookup API provides a mechanism for the Planning and Execution Platform components to discover and access SCAPE components. This has yet to be defined.

Component Registration API

Provides a means to register SCAPE Components (Taverna Workflows) in the SCAPE Component Catalogue.  This will be achieved using the existing myExperiment plugin for Taverna which utilises the functionality defined in [17].

Packaging and Deploying

SCAPE will use a dedicated version of the myExperiment platform [8] to host and share SCAPE Components.  This web-based site enables a common, accessible location for uploading, discovering and retrieving SCAPE components without the need for institutions to separately install their own catalogue.

The Components themselves are workflows which wrap locally and/or remotely installed software applications (tools), therefore a major challenge is how to make these workflows, and more specifically the tools they rely upon, available on the scalable Execution Platform. More specifically, any such solution should ensure reliable and convenient installation across multiple computing nodes that form the Execution Platform, including automatic resolution and installation of all necessary tool dependencies.

Packaging and Deploying Component Tools

As defined by the Component Profiles, SCAPE Component workflows must provide annotations that indicate which tools they depends upon. This information can be used to indicate which tools must be deployed to the Execution Platform.

To enable easy distribution, installation and updates to the tools that SCAPE Components depend upon, the Debian (Linux) software packaging and package management system will be employed. This provides a standardised and integrated way to manage and install software ensuring that all dependencies of that tool are also installed. Through this process an end-user can easily install software through a single command (or click in a GUI), passing responsibility to the package manager to download the package, resolve dependencies, and install the software. Such a system can also be configured to enable automatic updates and removal of software packages.

The process for building a Debian package and deploying it to a package repository is fully described in [D5.1].

Roadmap

This section briefly outlines upcoming and future work that needs addressing and relates to the SCAPE Component Catalogue. Table 2 provides a summary of these upcoming milestones and deliverables.

Specifically, provenance information about Workflow runs should be recorded and persisted in order to perform provenance analyses on the data in the main repository, for example enabling a trace of the set of transformations applied to an image. The requirements of this work are captured in [22] and will be implemented in Taverna, with a design document due in M20. Workflows will be stored and shared via the SCAPE Component Catalogue, based on the myExperiment site, with design and implementation documentation due in M24 and in deliverable D7.3 (M40). In particular the Component Lookup API needs defining in coordination with the Execution Platform and Planning components.

Table 2: Upcoming SCAPE Component Catalogue Milestones/Deliverables 

Milestone/Deliverable Description Due
MS40 Design and Implementation of the Component Catalogue M24
MS41 Final Preservation Workflow Sharing Platform M42
D7.1 Design of Provenance Component M20
D7.3 Design and implementation of the preservation component catalogue M40

Execution Platform

The SCAPE Execution Platform provides the necessary infrastructure to execute preservation plans and store appropriate digital objects in a scalable manner to aid execution. The goal is to enhance the scalability of storage capacity and computational throughput based on the utilisation of clusters of computational nodes, rather than single machines. These clusters, with appropriate control and workflows, will enable fast and efficient parallel processing of large numbers of digital objects by enabling tools (e.g. a file identification tool) to execute on multiple digital objects at the same time. As such, the platform is designed to support the coordinated and parallel execution of existing preservation tools and workflows, albeit these tools may require appropriate adaptations/compilation to enable effective parallel integration with the Execution Platform.

Workflows are used to wrap tool executions for storage in the SCAPE Component Catalogue and discovery by the Planning Tools and Execution Platform. Workflows can also be built up to define more complex workflows (these are also stored in the SCAPE Component Catalogue).  It is these workflows that are used by a Preservation Plan. 

As can be seen in Figure [1], the Execution Platform consists of three main sub-components: Parallel Preservation Components, the Parallel Execution System, and the Job Execution Service. Workflows developed within SCAPE (e.g. by TestBeds work packages, for example see the LSDR Executable Workflows for Experimental Execution deliverable [14]) are made available to the Execution Platform through the SCAPE Component Catalogue's Component Lookup API. These workflows are constructed using the Taverna workbench environment [Taverna Workflow Modelling Environment], however they are not optimised for execution on the Parallel Execution System, and as such, need to be pre-compiled (using the TavernaToHadoop Compiler) into Parallel Preservation Components. These components are hosted by the Execution Platform (the PPC Store in diagram [1]) and executed on the Parallel Execution System. The platform administrator is responsible for deploying such components to the platform, along with any tools they depend upon. A simple registry will be maintained to indicate which components are supported.

The Parallel Execution System sub-component provides the infrastructure for performing data-intensive computations by supporting the execution of Parallel Preservation Components. It makes use of multiple nodes for storing and processing data in order to increase the computational throughput, whilst maintaining coordination over the tasks to be completed. These connected nodes are known as a cluster. The SCAPE Parallel Execution System is built on Apache Hadoop integrated with the Apache Hadoop Distributed File System to provide flexible, scalable and reliable storage. This combination enables close proximity between the data and processing nodes, reducing transport overhead thereby enabling high computational throughput.

Execution of the Parallel Preservation Components is initiated and managed by clients through the Job Execution Service and its external RESTful interface (See the Job Execution Service API). Specifically, it understands SCAPE concepts such as Preservation Components, Data Connector API URLs, and (potentially) the SCAPE Data Endpoint. As part of its functionality, the Job Execution Service will not try to resolve the data to be operated on however, it would merely generate an appropriate input file understandable by a Parallel Preservation Component, based on the input URI provided to it from the DOR; the user would be responsible for ensuring the data, i.e. digital objects, were accessible by the Execution Platform. In terms of implementation it is unclear (at present) how much will be provided by Hadoop. The latest release of Hadoop MapReduce (called YARN) provides support for service-based Job submission and Resource Management which could possibly be used; this is currently being investigated for usefulness and to what extent it would require extending.

An alternative execution platform service is also being worked on, based on the Microsoft Azure platform [REF]. In essence, this provides a similar processing concept to the SCAPE Hadoop based Execution System, whereby multiple computational nodes are utilised to increase computational throughput through parallelisation. Azure provides the ability to reliably (replicated across three computers in the Azure data centre) store data close to these computational nodes, along with the ability to define and manage applications that process this data.

The following two subsections provide further details about Hadoop and Microsoft Azure with specific details (as they are currently known) relevant to SCAPE.

Apache Hadoop

Apache Hadoop primarily consists of two main sub-projects: MapReduce and the Hadoop Distributed File System (HDFS). MapReduce provides a parallel-processing mechanism that allows Hadoop to process large data sets in a relatively quick time. It has components to manage MapReduce jobs, aiming to ensure that computation occurs on the same node that data is stored, or failing that, on as close a node as possible to minimise network latency issues. Data storage is managed by HDFS, a Java based, distributed file system that provides reliable data storage across commodity hardware. Importantly, it stores data on the same nodes that perform the computation, thereby boosting performance.

MapReduce

MapReduce is a framework for parallel processing of large datasets across a large number of computers, or nodes.  It is divided into two steps: the Map step is where the input dataset is divided and shared out amongst worker nodes, where each worker node computes an answer to part of the problem; the Reduce step then collects and combines all the partial-answers into one.

Further details about MapReduce can be found in [13].

Hadoop Distributed File System (HDFS)

HDFS is a distributed and scalable file system designed to run on a cluster of machines. A cluster typically comprises of a Namenode server, that manages the cluster's file system and access to the files therein, and a number of Datanodes (typically one per node), that manage the storage on each node. 

Files, are split into one or more blocks, where each block is usually a multiple of 64MB, and stored across multiple datanodes. Replication of individual blocks across multiple nodes achieves reliability of the data. 

The same nodes are also used for computation in the MapReduce cluster, and because of the close connectivity between these layers, MapReduce jobs can often be scheduled to execute on the same nodes as the actual data, thereby reducing the amount of network data traffic and improving performance.

Hadoop Version used within SCAPE

SCAPE, in particular the Central Instances, currently use the patched distribution of Apache Hadoop provided by Cloudera. Specifically, the CDH3 update 2. The Cloudera distribution is used as this is kept up to date with patches solving various bugs and security/performance improvements that are available before a major Apache release. Furthermore, they provide good documentation.

CDH3 update 2 provides:

  • Hadoop version 0.20.2
  • HBase version 0.90.4
  • Zookeeper version 3.3.3
  • Hoop

Later updates to the Cloudera distributions do now exist (CDH3 is currently on update 5), and there is also a CDH4 release distribution which makes use of the latest Apache Hadoop 2.0.0 release. There have been significant changes to Apache Hadoop between version 0.20.2 and version 2.0.0, so an update within SCAPE should be considered based on SCAPE partner's current and expected deployments.

Hadoop/Taverna Workflow Integration

Hadoop is designed to operate on large data files rather than many small files, leading to questions over its performance ability when processing large SCAPE datasets. A number of experiments have been performed [20, 21] using Hadoop to try to ascertain the effects of file size versus number of files on Hadoop performance. One study [20] looked at file identification performance of files contained within ARC archive files, comparing the ARC file size versus the number of ARC files, but also looking at the performance impact from executing tools via a Java API or via the command line (through direct tool execution or via a JAR file). The results indicate that: a) increased data file size offers improved processing performance compared with smaller, more numerous files; and b) MapReduce jobs using a tool's Java API provides significantly better performance than invoking a command line tool (either a program or a JAR file). This is likely due to the start-up costs incurred when initiating an external tool, for example, the cost from starting a JVM to execute a JAR. Where possible, tool development should focus on creating Components that utilise tool APIs for execution.

This study is complemented by [21] which investigates the best approach to apply workflows to the Hadoop execution platform. Two possibilities present themselves: i) use Taverna as a scheduler and execute Taverna activities (sub-components of a workflow) compiled as MapReduce applications on the Hadoop cluster; or ii) use Hadoop as a scheduler to run entire workflows, as a whole (and compiled for the platform), on the cluster. There are some advantages and disadvantages to both approaches, such as passing of file references between Hadoop and Taverna or the level of integration with Taverna; performance wise however, using Hadoop as a scheduler and running entire, compiled workflows is significantly faster (despite the need to create an initial sequence file for processing) than using Taverna as the scheduler (see [21] for further details). Under this approach, Taverna Workflows need to be compiled to Parallel Preservation Components (through a Taverna-To-Hadoop compiler) for execution on the SCAPE Parallel Execution System.

Microsoft Azure

Windows Azure is an open and flexible cloud computing platform (Platform as a Service) that is used to build, host and scale applications across a global network of Microsoft-managed datacentres. It is possible to build applications using any language, tool or framework with features and services being exposed via open REST protocols. Azure provides a robust messaging system that allows for existing IT infrastructures to be integrated with applications running within the Azure environment, enabling the creation of scalable distributed applications and hybrid solutions that run across both cloud and on-premise environments.

Azure allows for applications to be scaled to any size, with resource usage management available in real time. Application code can be reliably hosted and scaled out, either vertically or horizontally, within compute roles. Data storage is available via relational SQL databases, NoSQL table stores or unstructured Blob stores, with the option to use Hadoop and business intelligence services to data-mine it. Further details about Azure can be found on [18].

Microsoft Azure within SCAPE

Within SCAPE, Windows Azure is used as another SCAPE platform. SCAPE Preservation Actions are run within Azure Worker Roles and they are communicate via internal endpoints. A Worker Role can be thought of as a process within an OS which is managed by Windows Azure, i.e. updating, spawning, etc. As the name suggests these are typically used for background processing of data. In addition to these Preservation Actions, Word Automation Services are run within Virtual Machine (VM) Roles to provide efficient batch processing of Word format related conversions. VM Roles can be considered more like dedicated instances of an OS that a user needs to manage and maintain themselves.

Further details about the use of Microsoft Azure within SCAPE are still to be defined.

Required Interfaces

The Execution Platform must implement the Job Execution Service API.

Job Execution Service API

The Job Execution Service API provides a REST interface for executing and monitoring Parallel Processing Components on the Parallel Execution System. The Digital Object Repository acts as a client to this service, and is responsible for initiating execution of Paralleling Processing Components, as defined by a preservation plan, against the data that it manages. This data should reside on the Parallel Execution System's Distributed Storage network prior to execution; the user is responsible for ensuring that the data is accessible from the Execution Platform prior to plan execution.

A Job Execution Service can be used by multiple clients enabling one platform to provide execution services for multiple Digital Object Repositories.

The API is yet to be defined and documented, with Milestone 32 and Deliverable 5.2 providing focus for this work.

Packaging and Deploying

No specific deployment or infrastructure is prescribed by SCAPE, and indeed the intention is for the platform to be versatile enough to suit individual institution needs. The system may be hosted using a private or institutionally shared hardware, it may be hosted by an external data centre, or it may be deployed on an IaaS infrastructure through virtualisation. 

Platform Releases

Platform Concept Release software was released in the summer 2012 consisting of the Central Instance platforms and a MapReduce tool wrapper enabling a user to easily execute command line applications as MapReduce jobs. The software is currently available from the SCAPE GitHub repository at https://github.com/downloads/openplanets/scape/pt-mapred-demo.tar.gz

This tool is used with a Hadoop installation (for example, Hadoop can be installed on a PC using a virtual machine such as VirtualBox running Ubuntu) and can execute command line applications such as the Unix File command or FITS file identification as a MapReduce job. The command to be executed is specified in a toolspec file (and passed in as an argument when executing the MapReduce job).

The first platform release is due in M24.

Platform Instances

There are two types of platform instance currently perceived within SCAPE: Central Instances; and Local Instances.

Central Instances

Central Instances are designed to provide SCAPE participants with pre-configured infrastructure upon which to experiment with platform software, to test and benchmark tools, workflows and Testbed scenarios, as well as to provide a platform for public demonstrations. Two instances are currently available, one from AIT and the other from IMF.

The AIT instance initially comprises a cluster of 10 virtual nodes (total 10 CPU cores) with an aggregated HDFS capacity of about 4TB (maximum 400GB per node). The platform is running Apache Hadoop (0.20.2-cdh3u2). A Fedora Commons-based repository is being added. Further details about how to connect to this cluster are described in [11].

The IMF instance consists of three dual-core AMD 1.6GHz (total 6 CPU cores), low consumption nodes, each with 8GB RAM and 15TB storage (5x 3TB HDDs). Details about how to connect and use this cluster are described in [12].

Local Instances

Local Instances are platform instances setup and maintained by an institution primarily to evaluate their own data sets. This typically occurs when an institution has licensing restrictions on the data preventing it from being uploaded to a public repository. By implementing a platform instance, institutions will be able to validate SCAPE's component-oriented architecture and the ability to deploy the SCAPE platform across various hardware and software platforms (e.g. using DOR's other than the SCAPE reference implementations).

Such instances may or may not be available to other SCAPE members.

Roadmap

The Execution Platform component forms a large and important section of work. A previous milestone has delivered an initial platform concept release, consisting of the Central Instance platforms and a MapReduce tool wrapper software. A first platform release is due in M24 followed by D4.1 deliverable providing details on the design of the Execution Platform including its main components, layering and interactions.

Table 3: Upcoming Execution Platform Milestones/Deliverables

Milestone/Deliverable Description Due
MS27 First Platform Release M24
D4.1 Architecture Design M26
D4.2 Final Release M36

Of concern to the Execution Platform is the means to execute workflows with high performance in order to process the large datasets exposed by SCAPE partners. The approach taken within SCAPE (based on experimental evidence) is to compile workflows to Parallel Preservation Components for execution on the Hadoop based Parallel Execution System, requiring the need for a Taverna-to-Hadoop compiler. An initial version of this is due M20 which will then continue to be developed. This Parallel Preservation sub-component will need to integrate with the SCAPE Component Catalogue, so it is imperative that an appropriate Component Lookup API is defined in a timely fashion.

Table 4: Upcoming Parallel Preservation Component Milestones/Deliverables

Milestone/Deliverable Description Due
MS34 Initial Translator for Taverna Workflows into PPL Algebra M20
MS35 Executing PPL on Hadoop M21
MS36 Enhanced compiler and optimiser for Taverna Workflows M30
MS37 Final evaluation of parallelisation approaches for preservation M38
D6.1 Report on the Feasibility of Parallelising Preservation Processes M26
D6.2 Demonstrator and Report on Workflow Compilation and Parallel Execution M34
D6.3 Optimisation of preservation processes M38

Finally, the Execution Platform component is responsible for providing an interface for initiating execution of Preservation Plans and monitoring their progress. The Job Execution Service API necessary for this is yet to be defined, although a prototype is due in M24.

Table 5: Upcoming Job Execution Service Component Milestones/Deliverables

Milestone/Deliverable Description Due
MS32 Job Execution Service Prototype M24
D5.2 Job Submission and Language Interface M28

The full details of how Microsoft Azure will integrate within SCAPE is currently unclear, although further details are expected at the upcoming Platform Ramp-up meeting (5-7th September, STFC).

One further consideration that needs to be addressed is review recommendation 6 from SCAPE's first year review, which suggests to consider using generally available cloud and/or grid infrastructures in order to make the SCAPE infrastructure available to a wider set of content providers.  This is currently being considered within the SCAPE Platform community.

Digital Object Repository (DOR)

A Digital Object Repository (DOR) is an OAIS compliant repository, providing a data management solution for storing the content and metadata of digital objects as well as preservation plans, and is responsible for helping its user community deposit, curate, preserve and access such content. It exposes its services through well-defined APIs, enabling it to interact with the Execution Platform in order to carry out preservation actions, or to interact with external (to the Platform) components, such as Planning and Watch, triggering execution of preservation plans or reporting information back to the Watch component.

Digital objects are comprised of content, the actual data to be preserved such as images or audio/video files, and metadata representing the technical, administrative, structural and preservation information. Semantic relationships between digital objects are possible, and these are represented using RDF and stored within a triplestore. A DOR must therefore provide the means to store the contents and metadata of digital objects, along with any relationships, and in particular, make this information accessible to other components, such as the Execution Platform. The latter requirement is achieved through a HTTP based interface, known as the Data Connector API.

Accessing digital objects through a HTTP interface has performance implications that should be considered however. The request duration overhead when requesting binary content via HTTP varies depending on the size of the requested content. With small sized content the overhead is negligible, however with large binary content the overhead becomes significant. To accommodate this, SCAPE defines two strategies, letting stakeholders make the most appropriate choice to suit their needs: a Managed Content approach whereby files are accessible only through the Data Connector API; or a Referenced Content approach whereby files are stored in a file system directly accessible by the SCAPE platform and the Data Connector API merely passes references to this content. The former approach is not suited to large amounts of data or where storage and computation are geographically separated because of the IO overhead for data retrieval; the latter, on the other hand is suitable for large files as they can be handled (by reference) without having to moving them between machines, however it does mean that the storage file system must be directly accessible to the platform.

In order to provide efficient computation, the DOR may store (or replicate) its content directly to the Execution Platform's storage system. It may also store outcomes of workflows (or parts thereof) that have been executed against the DOR's contents, so it is vital that the DOR employ a suitable data model and scalable object store. Transfer of data to the Execution Platform's storage system (i.e. HDFS) is the administrator's responsibility.

Batch loading of data into a DOR will be supported by a Loader Application, which handles validation, error logging and retrying, and makes use of a HTTP endpoint for ingesting objects into the repository. Authentication is achieved through HTTP Basic Authentication, with encrypted communication using HTTP over SSL/TLS being highly recommended. Full details about the RESTful API are described in the Connector API specification [4].

Digital Object Model

Existing repositories already provide their own Digital Object Model for effective storage of digital content and metadata. Such diversity is a hinderance to the SCAPE platform in terms of being able to successfully integrate with every repository.  Instead, a common DOM is required. 

The SCAPE Document Object Model is described in detail in [REF], essentially however, it is based on a combination of a METS XML container and PREMIS preservation metadata. Each Intellectual Entity is represented by one METS file, and each Representation and File will be described by administrative metadata.

The OAIS model describes, at an abstract level, the requirements that a long-term preservation archival system must fulfil. Within this model is the notion of three Information Packages: Submission Information Package (SIP); Archival Information Package (AIP); and Dissemination Information Package (DIP). Within SCAPE, these packages are METS files adhering to the profile defined in [REF], which defines the mandatory, optional and forbidden elements along with the metadata schemas that should be used for metadata (e.g. descriptive metadata must only use Dublin Core terms, and rights metadata must only use PREMIS rights schema). Each METS document must be assigned a globally resolvable, persistent and unique identifier (recorded in the OBJID attribute), although no specific schema is prescribed.

As an ingestion package, the SIP is slightly more flexible, in terms of the minimum elements that should be present in the METS file, than the AIP or DIP. For example, no <amdSec> element is required in a SIP. Furthermore, no METS identifier is needed assuming that one will be assigned to the AIP by the repository. Both the AIP and DIP however, have the same profile containing technical and digital preservation metadata and potentially information about the preservation plan associated with the Intellectual Entity.

Preservation Plans

Preservation plans can be serialised to XML based on the PLATO XML Schema definition [19] - this schema needs updating to reflect updates needed within SCAPE. The plan itself is stored as an AIP in the repository. Executed plans have their provenance information and plan execution details stored in the digital provenance section of the AIP.  

Reference Repositories

Four repositories are targeted as reference implementations for the SCAPE repository:

  • Fedora/eSciDoc
  • Fedora/DOMS
  • Fedora/RODA
  • Rosetta

The Fedora-based eSciDoc repository will be used as a reference implementation, with DOMS and RODA implementing the necessary functionality based on this reference implementation.

Required Interfaces

Repository systems must implement three APIs, the Data Connector API, the Report API and the Plan Management API, to be used within the SCAPE platform. Any repository implementing these APIs should be able to be used in a SCAPE platform.

Data Connector API

The Data Connector API integrates different repositories with the various SCAPE components, allowing these components to access the repository content and preservation plans. It does this by exposing a well defined RESTful interface via HTTP services. Discovery of objects is via an SRU (Search/Retrieve via URL) [2] search endpoint. This Data Connector API is defined in [4].

Report API

The Report API enables communication between a DOR and the Watch component.

The Watch component must monitor repositories, amongst other sources, for information about their contents and the actions that take place on them. In general terms, the Watch component defines Source Adapters to collect information from each source, however as each repository has its own internal information structure and naming schemes, the Watch component would have to create a new Source Adapter for each repository.  To prevent this, integration between Watch and repositories is split into two parts: a Report API that is implemented by every repository and provides a unified interface enabling Watch to retrieve information about events taking place in the repository; and a repository Source Adapter, implemented by the Watch component, that connects to the Report API.

From the repositories point of view, the Report API is sufficient to enable the repository to be used as a Watch input, i.e. the repository does not need to implement the ISourceAccess API.

The events exposed and the methods that must be implemented by this Report API are defined in [5] and based upon the OAI-PMH protocol [6].

Plan Managment API

Provides HTTP endpoints for retrieval and management of preservation plans from the SCAPE digital object repository. Plans are represented using XML and can be searched for, based on their significant properties, using SRU (Search/Retrieve via URL) [2] searching through the relevant endpoint. Queries are represented using Contextual Query Language (CQL) [3].

Endpoints are defined in [1], along with relevant HTTP status codes.

Roadmap

Of particular importance as a SCAPE output is the DOR reference implementation, which will provide insight and guidance on how to implement the three main APIs required by a DOR, as well as demonstrate the data structure ('content model') support needed by DORs in order to trace provenance information and digital object versions. The reference implementation will be based on eSciDoc, with DOMS and RODA implementing the necessary functionality based on this reference implementation in order to demonstrate the adaptability of Fedora-based repositories and give credence to the reference implementation's approach. The upcoming milestones and deliverables reflect this work.

In addition, a Technology Compatibility Kit (TCK) will also be developed to enable repository developers to test their implementation of the Data Connector API.  The TCK will consist of a HTTP Client to test the various Data Connector endpoints, essentially mocking an implementation of a client and testing creation, retrieval and updating of objects within the repository, in accordance with the specification.

Table 6: Upcoming Digital Object Repository Milestones/Deliverables

Milestone/Deliverable Description Due
MS43 Preservation-Aware Content Models Reference Implementation M30
MS44 Reference Implementation with Interface to Executable Workflows M36
D8.1 Recommendations for Preservation-aware Content Models M36
D8.2 Reference implementation of DOR with interfaces to preservation components, workflows, and execution M42

SCAPE Data Publication Platform

Functional Overview

The SCAPE Data Publication Platform provides a scalable means to publish linked-data results from experiments and workflows whilst recording provenance and versioning information about the results, e.g. who published the results, when were they published, what tools were used. Providing this additional metadata establishes trust in the data, and provides access to historical information enabling decision processes based on this data to be reviewed.

This repository and publishing point for SCAPE experimental results allows them to be historically referenced. As an example, consider a workflow executing the DROID file identification tool over a sample file set. When executed with particular versions of DROID, or with different signature file, DROID may incorrectly identify specific file formats (e.g. Microsoft Word docx); as tool and signature files development iterates, inaccuracies will be corrected (although new ones may be introduced). The file format identification coverage of any specific version of the software, or signature file is therefore hard to ascertain without referencing experimental results.

Another example of experimental results for publication are comparable metrics for digital preservation tools or workflows. These metrics could be performance based, e.g. tool X takes two hours to convert data set Y to PDF, while tool Z took four hours, or quality based, e.g. tool X lost the headers and footers from the document pages, while tool Y retained them. As the SCAPE Preservation Componenet workpackages and TestBeds continue to develop new tools and workflows, they aim to produce just this type of data

The SCAPE Data Publication Platform aims to store experimental results with additional temporal information making it possible to capture and publish changing tool behaviour in a form where the associated risks can be discovered and reported by the Watch component. Data could either be pushed from the Publication Platform to the Watch Knowledge Base via the Watch Push API, or a Watch Source Adaptor for the experimental data could be developed.

Why Linked Data?

Automating the Watch component as much as possible, in particular the access and retrieval of Source information, would greatly improve scalability (and reliability) of this component. Therefore, in terms of accessing experimental results data, such information idealy needs to be in a self-describing form capable of being consumed by other computing components (i.e. Watch). This open, identifiable data enables the generation of new knowledge through linking multiple datasets and complex reasoning, for example P2's linking of PRONOM and DBpedia enabled answers to questions such as "What tool can open a particular file?". 

Linked data could therefore be of major benefit to the Watch component, but there are well-known challenges with using linked data, especially when concerned with digital preservation. In particular, trust and provenance information are hard to come by; data is represented by RDF triples which describe the relationship (predicate) between some subject and an object (value), however there is no notion of who published this information and when. Relatedly, most data in linked datasets represents only the current knowledge - it is hard to get historic data. To help overcome these challenges the Linked Data Simple Storage Specification has been defined and shall be used as a convenient, scalable means to store and publish SCAPE workflow data.

The Data Publication Process

A little needs to be said about the process of publishing SCAPE experimental data. The Data Publication Platform is not intended to store the type of temporary data that is generated while developing and testing a tool or workflow. It's designed to provide a permanent home for significant experimental data that is of value to others in the digital preservatiton field, e.g. preservation planners, or tool developers. The experimental results published should be the results of reproducible digital preservation experiments performed on open data sets. The first part of the process is the gathering of experimental data, the form of the data is not important as long as it is machine interpretable, i.e. CSV, XML, JSON, etc. are all suitable. It is important that the experiment is performed on a data set that can be openly shared, for example the GovDocs corpus. Experiments performed on private data sets are not reproducible and the results will not be considered for publication.

The data set and results can now be considered for publication. Details of the data set, and where it can be obtained will be published

Question for Dave
Explicitly where? I know you mentioned the OPF site, but I can't recall the details, sorry.

Specific loaders will have to be developed to convert the data into a form suitable for loading into the LDS3 (see below) store.

Technical Overview

Linked Data Simple Storage Specification

Building on the P2-Registry [REF], the Linked Data Simple Storage Specification (LDS3) provides a system for automating the process of publishing data, whilst helping to maintain trust and versioning information. It does this by extending the triple based RDF model to a quad model, known as a named graph, utilising the fourth dimension to convey facts about the author, publisher, publication time, etc. This is enforced by LDS3, which automatically annotates hosted data with publisher and publication time alleviating the user of this task. Resources (e.g. people, file formats, etc.) cannot be directly created, updated or deleted, and instead have to described in a published document, i.e. a named graph.

Named graphs are versioned through a combination of GUID and time stamp in the URI scheme used to reference data publications. In this manner, both specific time stamped versions of a publication can be retrieved from storage, as well as the latest version (no time stamp specified).

LDS3 provides a HTTP CRUD (Create, Retrieve, Update, Delete) based interface.  Data is HTTP POSTed to the server, returning the location of the created resource. An additional (edit-) IRI is also returned that is used to update or delete the document; this, coupled with the fact that a users can only manipulate data through published documents, means that such ammendments are restricted to only that data which a specific user added.  All HTTP REST requests must be signed as per the approach employed by Amazon's Simple Storage Service (S3) [REF]. This signs only the request portion of a transaction meaning there is no performance degradation as only uni-direction communication is required from client to server.

The full specification is available at [http://www.lds3.org/Specification].

Reference Implementation

A reference implementation of the LDS3 specification has been developed, utilising existing libraries where possible. The OAuth2 module [REF] is used for users to register and obtain authentication key-pairs used in authenticating requests. Document annotation is performed by the Graphite library [REF]. The quad store, 4store [REF], is used to store the quads, enabling their indexing and querying. A patched version of the Puelia-PHP application [REF] (which is a PHP implementation of the Linked-Data API [http://code.google.com/p/linked-data-api/]) is used to handle incoming requests in accordance with a dataset configuration file, which details a URI pattern to match and a corresponding SPARQL query to execute; the patch enables retrieval of named graphs from the document URL, a dated URI or an edit-IRI.

Linking With Scape

The LDS3 specification imposes some restrictions on clients [TODO: Check LDS3 spec for example]. Whilst the interface for creating and updating publications within the LDS3 server can be done through HTTP requests (that conform to the specification), the Watch component expects Sources to implement a Push or Pull interface, which is likely to differ from the LDS3 REST based API. A simple adapter may be required to interface between the Watch component and the SCAPE Data Endpoint. A LDS3 client module will be required on the Platform to publish data.

Question
Given that the experimental data will take different forms, i.e. performance metric data is not of the same form as format identification data, will this make creating a single generalised Source Adaptor difficult? That is to say that each different result set might require a bespoke data transformation, this might make use of the Push API more practical.

Roadmap

A reference implementation of the LDS3 specification has been developed. Appropriate connection with the Watch component needs to be considered, potentially requiring a simple adapter to provide the interface.

Table 7: Upcoming SCAPE Data Endpoint Milestones

Milestone/Deliverable Description Due
MS89 Result Evaluation Framework (REF) containing Identification Data M25

User Agents

User Agents are key components offering user interaction though graphical user interfaces, or command line interfaces. It is principally a collective term used to group together UI components existing outside of other components (that interact through exposed APIs), therefore other user interfaces may also exist, for example a PLATO planning UI.

Taverna Workflow Modelling Environment

Taverna Workbench [23] is a Java based open source tool for designing and executing scientific workflows. It comprises a graphical workbench for creating and modifying workflows and the Taverna engine for executing the workflows. The engine is also part of the Taverna Server, enabling remote execution of a workflow, or it is available separately or as a command line utility. This environment will be used to generate SCAPE Component Workflows for 

SCAPE Plan Management GUI

As a means to view and control execution of preservation plans on the SCAPE Execution platform, some form of user interface is required.  This does not have to be a graphical interface, however SCAPE are working on an example GUI [14].

The SCAPE Plan Management GUI will utilise the Plan Management API provided by the DOR to manage the plans available, and to initiate their execution. By utilising only this API, it will be possible for this GUI to be used by any SCAPE compliant DOR. It is feasible (and acceptable) however, for a DOR to implement its own user interface, bypassing or augmenting the Plan Management API to potentially provide enhanced support that is specific to that DOR. Any such implementation is outside the scope of SCAPE.

Loader Application

The Loader Application provides a means for an administrator to load digital objects into a DOR. It uses the Data Connector API provided by the DOR to enable the loader application to work with any DOR. As per the SCAPE Plan Management GUI, it is possible for a DOR to implement its own Loader Application, however this is outside the scope of SCAPE.

A reference implementation will be developed within SCAPE, resulting in an SDK that can be wrapped by a GUI or accessed through a command line interface. This implementation will address two main use cases: the ingest of managed content (where the SIP includes the metadata and binary object files); and the ingest of referenced content (where the SIP includes only the metadata and has URI references to previously uploaded binary object files).

SIPs are expected to be created prior to uploading in accordance with the SCAPE Digital Object Model. They can be created manually, or by a SCAPE SIP creation tool (still to be developed). The application is designed to support deposit of digital objects regardless of their size, allowing both for a SIP to be POSTed to the repository, or for a reference to its location to be POSTed and for the repository to retrieve it directly.

The Loader Application specification is specified in [13].

Roadmap

The Taverna Workbench is available for use already, however enhancements are needed for SCAPE use, in particular, capturing provenance information about digital objects as well as capturing and validating Component Profiles.

An example SCAPE Plan Management GUI is available [14], however this is currently only a front-end GUI with no connection to the DOR. An appropriate user interface (GUI or otherwise) is required to manage Preservation Plan executions via the Plan Management API.

A reference Loader Application will be developed as a means to load digital objects into a DOR. This will make use of the Data Connector API.

Table 8: Upcoming User Agents Milestones/Deliverables

Milestone/Deliverable Description Due
MS42 Loader Application Reference Implementation Deployed on Shared TestBed M24
D7.2 Workflow Modelling Environment M36

References

  1. "Plan Management API", F. Asseg, M. Hahn, 2012, SCAPE, https://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP.5%20Repository%20Integration/Plan%20Management%20API.docx
  2. http://www.loc.gov/standards/sru/
  3. http://www.loc.gov/standards/sru/specs/cql.html
  4. "Connector API", F. Asseg, M. Hahn, 2012, SCAPE, https://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP.5%20Repository%20Integration/SCAPE-Connector-API-final.doc
  5. "Report API Specification", R. Castro, M. Ferreira, L. Faria, F. Asseg, P. Petrov, 2012, SCAPE, https://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP.5%20Repository%20Integration/SCAPE_ReportAPI.docx
  6. http://www.openarchives.org/OAI/openarchivesprotocol.html
  7. http://www.myexperiment.org
  8. "Preservation Components Profile", SCAPE wiki, v. 25, http://wiki.opf-labs.org/display/SP/Preservation+Component+Profiles
  9. http://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP1%20Architecture/scape_logging_into_the_ait_cluster.txt
  10. http://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP1%20Architecture/PT.WP1_SCAPE_Infrastructure_V0.4
  11. http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
  12. "Plan Management Mock-Up", SCAPE wiki, v. 4, http://wiki.opf-labs.org/display/SP/Plan+Management+Mock-Up
  13. "SCAPE - Loader Application", Y. Brama, R. Castro, F. Asseg, M. Hahn, 2012, SCAPE, http://portal.ait.ac.at/sites/Scape/PT/Shared%20Documents/PT.WP.5%20Repository%20Integration/SCAPE_Loader_Application-FINAL.docx
  14. "LSDR Executable Workflows for Experimental Execution", D16.1, C. Wilson, P. May, S. Schlarb, B. Jurik, https://portal.ait.ac.at/sites/Scape/Shared%20Documents/Deliverables/Final/SCAPE_D16.1_BL_V1.0.pdf
  15. http://wiki.opf-labs.org/display/SP/Preservation+Component+Profiles
  16. http://wiki.myexperiment.org/index.php/Developer:API
  17. http://wiki.myexperiment.org/index.php/Developer:WorkflowsResource#Create_workflow
  18. http://www.windowsazure.com/en-us/develop/overview/
  19. http://www.its.tuwien.ac.at/dp/plato/schemas/plato-3.0.1.xsd
  20. https://portal.ait.ac.at/sites/Scape/Shared%20Documents/Sub-Projects/Testbeds/TB.WP.1%20Web%20Content%20Testbed/work/2012.06%20-%20ARG.GZ-TIKA%20Hadoop%20experiment.doc
  21. https://portal.ait.ac.at/sites/Scape/Shared%20Documents/Sub-Projects/Platform/String%20Evaluation%20v1.1.pdf
  22. http://wiki.opf-labs.org/display/SP/PT.WP.4+Task+2+CP046+Requirements+documents+for+provenance+component
  23. http://www.taverna.org.uk/
  24.  
Labels:
initial initial Delete
page page Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.