Current state-of-the-art digital preservation procedures create plans that specify the preservation actions to be applied to well-understood and homogeneous parts of the content held in a repository, whilst conforming to specified objectives and constraints. A key goal of the SCAPE project is to develop appropriate mechanisms in order to help automate the initiation, monitoring, execution and evolution of such plans and help react to a dynamically changing environment and user behaviour. That is, to advance the control of digital preservation actions from ad-hoc decision making to a proactive, continuous preservation management.
The aim of the Automated Watch work package is to substantially improve automated support for effective digital preservation watch. To do this it is:
- Increasing the breadth and scope of collected digital preservation information.
- Normalizing and structuring gathered information into a queryable Knowledge Base.
- Allow both human and automated systems to ask questions about information in the Knowledge Base.
- Developing software components to monitor information in the Knowledge Base for significant events.
The Automated Watch component is an automated information gathering and monitoring system that:
- Gathers information relavent to digital preservation activities from a wide variety of sources, e.g. repositories, technical registries .
- Allows software agents and human operators to add relavent information.
- Allows software agents and human operators to ask questions about gathered information through Watch Requests.
- Provides an automated monitoring system that looks for changing information and re-answers questions to a schedule.
- Assesses answers against conditions and triggers, and alerts external agents when conditions are met.
- Provide a simulator that analyses the information gathered from a users repository and projects the future state of the repository to facilitate Preservation Planning for future needs.
A three-tier architecture has been deveoped, depicted below:
Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor.
The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner. A planner is able to make a Watch Request, either synchronous or asynchronous, to the Watch component via a Client Service in order to query and be notified about some specific measurement(s) of interest. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests are used to set the Watch component to monitor for changes in specific measurements (by specifying Conditions), triggering a Notification, for example an email, to the requesting client when such a change is detected. This approach does not block the requesting client. The notification type can be set when initiating the Watch Request.
The following sub-sections discuss the various sub-components involved in the Watch Component and how they interact.
Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world for which there is a way of measuring the properties associated with it, and can be internal or external to the project. Key sources currently considered are:
- Format Registries
- SCAPE Preservation Components catalogue (MyExperiment)
- Policy models
- Experiment Results
- Content Profiles
- Human Knowledge
- Web Browser snapshots (being developed within Watch)
- Simulator to assess Planning and Watch decisions (being developed within Watch)
Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the Source Access Pull API or the REST Source Push API. An exception would be the Digital Object Repository which implements a Report API for interaction with a relevant Source Adapter.
A Source Adapter gathers information from a Source and transforms it to fit the Entity/Property model adopted for the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.
The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors described below.
There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into Entities/Properties for passing on to the Merging and Linking component. The Adaptor code is part of the SCOUT project and can be found here .
C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the FITS tool and aggregates into a MongoDB document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving the aggregated profile data. The C3PO project an be found on GitHub here .
The C3PO adaptor reads and parses the XML data generated by the C3PO REST API and retrieves a subset of the content profile data. The data is then converted into the Watch Entities and Properties before being passed to the Data Merging and Linking component, which enriches the data before inserting it into the Watch Knowledge Base.
Both of these adaptors are in an early state of development, but provide examples of how to develop a plug in adaptor for the Preservation Watch system.
While there could be adaptors written for many different sources of information, the following will be developed as part of Automated Watch:
- RODA Adaptor & eSciDoc Adaptor
These are both repository adaptors that will gather information about a users collection, e.g. Content Profiles.
- Component Catalgue Adaptor
The component catalogue adaptor will gather data from the SCAPE Preservation Component Catalogue via the Component Lookup API. The information gathered in the Knowledge Base will allow planners to find tools that meet their planning requirements, or alert them when tools that provide new functionality required for preservation activities that had not previously been available.
- Policy Model Adaptor
A policy model adaptor is assumed to be the form of the Preservation Watch component required to incorporate the Machine Interpretable Policy Model into the Automated Watch Knowledge base. This source will allow the Automated Watch system to query an organisation's preservation policy and generate notifications when there are significant policy changes.
In order to control the scheduling of Source Adaptors used by a particular Automated Watch instance a Source Adaptor Manager is being developed. This manager will allow an operator to:
- Install new adaptors.
- Upgrade installed adaptors.
- Enable or disable installed adaptors.
- Manage the scheduling of adaptors, i.e. how often the external sources are queried.
This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base. The source adaptors convert data to fit the internal data model, but different source adaptors may present contradictory, or incompatible data. The merging and linking component further process incoming data by:
- Merging data.
- Resolving inconsistencies between data sources.
- Providing additional cross-references between entities and properties.
It is the addition of the cross-references that will enable rich queries of the Knowledge Base. A simple example of the added value by the merging and linking component can be given by considering 3 simple sources:
- A MIME based format registry
- The PRONOM format registry
- A collection profile consisting of file format distribution of the collection
The format records gathered from the registries would need to be linked so that PRONOM IDs were linked with the appropriate MIME records. Additionally the file format distribution records would be linked to the appropriate format entities gathered from the format registries.
The component provides an internal Delegate Data interface used by Source Adapters and the Push Source API to push information to it, and makes use of the Knowledge Base's Submit Data API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.
The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely Submit Data and Access Data.
A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process. The Knowledge Base also stores all of the questions posed by software agents or external users.
It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.
The Knowledge Base will also store the questions submitted to the Automated Watch system through the REST API. The questions may have been submitted by a human operator using the web interface or an automated component calling the REST API directly.
The Knowledge Base uses Apache Jena , a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.
This is a web interface which provides the following functionality for planners:
- The manual addition of information to the knowlege base.
- Browsing of the knowledge base.
- The Submission of Watch Requests to the Automated Watch system.
The Watch Client GUI is a Java Web application, packaged as part of the Automated Watch web application. The GUI provides and interface that allows the user browse the information in the knowledge base and to add new information through the Watch Push API.
The GUI also provides a means by which human operators can submit Watch Requests. Watch requests consist of:
- One or more pre-defined Questions that assess some aspect of the world.
- One or more Triggers that define rules for returning answers to the Watch Request.
Monitoring services observe one or more information sources and re-calculates answers to questions held in the Knowledge Base, when the results rely upon external information that has changed. The questions are those added by human operators or automated systems as part of a Watch Request.
The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a Data or Question Changed interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the Assessment Service.
[Watch Component: Monitor] Given the Watch component architecture has evolved, are these 4 monitoring services still planned: Repository Monitor, Format Registry Monitor, Component Catalogue Monitor, and Policy Model Monitor
The assessment service is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base. and providing the Monitor with information about whether to send a Notification.
It is instigated by the Monitor sub-component in response to updated data being available in the Knowledge Base or an update to the Watch Request. Access to the Knowledge Base is provided by the internal Access Data interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred.
When the Monitor sub-component detects a significant event, based upon the the questions and conditions stored in the Knowledge base, the Notification Service is used to alert interested parties.
The Automated Watch component must implement two external facing APIs: the Push Source Adaptor API and the Watch Request API.
The Automated Watch Push API provides a means for third party software agents to add information to the Watch Knowledge base without the development of a Source Adaptor. Note that push sources will not be controlled by the Source Adaptor manager, so that scheduling, and indeed enabling / disabling an unwanted push sources will have to be done by other means. The API may also be used internally by the Watch Client Web GUI to add new information via the web front end.
In the push model, the Source will send information to the Watch component as and when it becomes available. Software must be developed for the Source component to achieve this, which in some circumstances this not be possible. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal Adapter Configure interface.
This API will be used by external software agents to submit Watch Requests to the Automated Watch system. Typically the software agents will be:
- The Watch Client Web GUI .
- The Automated Planning System.
A watch request is made up of a number of pre-defined questions, drawn from the watch knowledge base, and a number of triggers. Triggers are assessed against questions and, if the trigger conditions set in the watch request are satisfied, the planner or software agent is notified, for example by email.
Both APIs are being implemented as RESTful services deployed with the Automated Watch Java Web Application.
The Automated Watch System is Java Web Application built from the GitHub OpenPlanets SCOUT Maven Project and deployed as a Web Application Resource. The project does rely upon the JBOSS Java EE 6 library so may require a dedicated JBOSS server, rather than a Tomcat Servlet.
RESTful services are provided through JERSY an implementation of REST services for Java, and part of the GlassFish project.
With current state-of-the-art procedures in digital preservation we can define organisational contraints and we can create plans that treat a homogenous sub-set of a large repository. PLANETS defined a Preservation Plan as follows:
A preservation plan defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records (called collection). The Preservation Plan takes into account the preservation policies, legal obligations, organisational and technical constraints,user requirements and preservation goals and describes the preservation context, the evaluated preservation strategies and the resulting decision for one strategy, including the reasoning for the decision. It also specifies a series of steps or actions (called preservation action plan) along with responsibilities and rules and conditions for execution on the collection. Provided that the actions and their deployment as well as the technical environment allow it, this action plan is an executable workflow definition.REF
PLANETS also produced a preservation planning methodology, a structured workflow for creating, testing and evaluation preservation plans. The PLATO planning tool developed within PLANETS follows this workflow to build preservation plans. PLATO produces an executable preservation plan along with audit evidence documenting the decision making procedures used in creating the planREF. However the plans were:
- Largely constructed manually, which could be a time intensive procedure.
- Were not normally applicable to all of an organisations holdings, but were restricted to a, normally homogenous, sub-set of a collection.
- Were not deployed and executed automatically in a repostitory.
- Had to be monitored manually for changes in best practice, collection profile, etc.
Further no mechanism exists to relate preservation policies to preservation plans, correlation has to be done manually.
The goals of SCAPE are to provide an automated planning component that is informed by:
- The accumulated knowledge of previous preservation plans
- An organisation's digital preservation policy
- An organisation's digital collections
- Other queries performed on the Automated Watch Knowledge Base, e.g queries of File Format Registry information.
The Automated Planning component comprises three sub-components:
- The Plato Planning Tool.
Building upon the existing PLATO tool but using the Watch Component, the Policy Model and content profiles to automate the creation of preservation plans.
- A machine interpretable model of preservation policy elements.
Modelling preservation policies from the top down as a catalogue higher level policy elements, and from the bottom up as a machine interpretable model of actionable low level policy elements in order to inform and automate the planning process, and provide information to the Watch Knowledge Base.
- An web based analysis tool for mining the results of previous preservation plans.
A web GUI that can be used to query past preservation plans and provide decision support to the planning process.
The SCAPE planning component continues the development of the PLATO Planning tool used in the PLANETS project. As described the PLANETS PLATO tool was capable of producing executable preservation plans to an established preservation planning methodology. These plans had various shortcomings described in the introduction, the aim of the automated planning tool is to address those shortcomings. The automated planning tool is a web based
The PLATO Knowledge Base is based upon the accumulated experience of preservation plans. Plans from different organisations
This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the decision maker (preservation manager?) along with a number of features facilitating systematic analysis.
Check terminology: decision maker = preservation manager? Prefer to limit, condense and be consistent with the terminology used.
The Automated Planning work package is responsible for the development of of the
Preservation policies are governance statements that constrain or drive operation Preservation Planning but may also have other effects outside of operational planning. For Planning and Watch policy elements have been divided into 3 classes:
- Guidance Policies
- strategic, high level policies
- are expressed in natural language
- can't be expressed in machine interpretable form and require human interpretation
- Procedural Policies:
- model the relation between guidance policies and control policies
- can be represented in a formal model as the relation between guidance and control policies
- Control Policies:
- are specific and can be represented in a semantic model
Only the control policies are guaranteed to be represented in the machine interpretable policy model. The development of the machine interpretable policy model is led by the development of a catalogue of policy elements.
The policy element catalogue provides a semantic representation of generic policy elements that is understandable by preservation systems. The intital version of the policy catalogue lists a set of Guidance Policies which, by definition, will not appear in the machine interpretable model in their full form, as a table in Deliverable. Instead these must be broken down into sets of Procedural Policies, which in turn will be represented by sets of Control Policies that will be used to create the machine interpretable policy model. The iterative process of refining the catalogue will be undertaken by using the catalogue to express the real preservation policies of three partners representing the needs of Large Scale Digital Repositories, Web Archives, and Scientific Data Sets. Once validated the catalogue will be used to develop the machine interpretable model.
The machine interpretable policy model provides a source for the Automated Watch system and will inform the Automated Planning system. Standard tools such as RDF/OWLREF will be used to define the terms used to describe and represent Control Policies and support policy reasoning. Similarly to the catalogue, the policy model will undergo an iterative process of testing and refinement while been used to model the various Testbed scenarios.
There is a GitHub project where the semantic model of low-level Control Policies is being developed. The project contains:
- The current version of the policy model ontology.
- Some example properties, criteria, ojectives, and scenarios.
- Some experimental queries developed in Java.
There are no recognised interfaces developed as part of the policy modelling workpackage. The Automated Watch and Automated Planning components are both responsible for developing software components that will interpret the model and base decisions upon the policy elements. The Policy Modelling work package is responsible for ensuring technical interoperablility between these components and the policy model.
The Automated Planning Tool is a JBOSS SEAM based web-application.
There is a central instance of the Planning Tool hosted by TUWIEN , that is currently running version 3.0.1. New releases will continue to be hosted here.
Organisations wishing to host their own PLATO instance would first require a JBOSS server that has been installed and set up seperately on which to host the application.
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. [Systematic planning for
digital preservation: Evaluating potential strategies and building preservation plans International Journal
on Digital Libraries (IJDL), December 2009.] | http://publik.tuwien.ac.at/files/PubDat_180752.pdf]
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. [Plato: A Service Oriented Decision Support System
for Preservation Planning | http://publik.tuwien.ac.at/files/PubDat_170832.pdf]