Current state-of-the-art digital preservation procedures create plans that specify the preservation actions to be applied to well-understood and homogeneous parts of the content held in a repository, whilst conforming to specified objectives and constraints. A key goal of the SCAPE project is to develop appropriate mechanisms in order to help automate the initiation, monitoring, execution and evolution of such plans and help react to a dynamically changing environment and user behaviour. That is, to advance the control of digital preservation actions from ad-hoc decision making to a proactive, continuous preservation management.
The aim of the Automated Watch work package is to substantially improve automated support for effective digital preservation watch. To do this it is:
- Increasing the breadth and scope of collected digital preservation information.
- Normalizing and structuring gathered information into a queryable Knowledge Base.
- Allow both human and automated systems to ask questions about information in the Knowledge Base.
- Developing software components to monitor information in the Knowledge Base for significant events.
The Automated Watch component is an automated monitoring system that identifies:
- Preservation risks.
- Curatorial opportunites (e.g. cost reduction).
- Possible shortcomings in current preservation actions.
This component provides the mechanisms for gathering information from various sources, including digital content and repositories, institutional policies, designated user communities, and other systems. A three-tier architecture has been deveoped, depicted below:
Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor.
The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner. A planner is able to make a Watch Request, either synchronous or asynchronous, to the Watch component via a Client Service in order to query and be notified about some specific measurement(s) of interest. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests are used to set the Watch component to monitor for changes in specific measurements (by specifying Conditions), triggering a Notification, for example an email, to the requesting client when such a change is detected. This approach does not block the requesting client. The notification type can be set when initiating the Watch Request.
The following sub-sections discuss the various sub-components involved in the Watch Component and how they interact.
Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world for which there is a way of measuring the properties associated with it, and can be internal or external to the project. Key sources currently considered are:
- Format Registries
- SCAPE Preservation Components catalogue (MyExperiment)
- Policy models
- Experiment Results
- Content Profiles
- Human Knowledge
- Web Browser snapshots (being developed within Watch)
- Simulator to assess Planning and Watch decisions (being developed within Watch)
Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the Source Access Pull API or the REST Source Push API. An exception would be the Digital Object Repository which implements a Report API for interaction with a relevant Source Adapter.
A Source Adapter gathers information from a Source and transforms it to fit the Entity/Property model adopted for the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.
In the push model, the Source will send information to the Watch component as and when it becomes available. Software must be developed for the Source component to achieve this, which in some circumstances this not be possible. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal Adapter Configure interface.
The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors described below.
There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into Entities/Properties for passing on to the Merging and Linking component. The Adaptor code is part of the SCOUT project and can be found here .
C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the FITS tool and aggregates into a MongoDB document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving the aggregated profile data. The C3PO project an be found on GitHub here .
The C3PO adaptor reads and parses the XML data generated by the C3PO REST API and retrieves a subset of the content profile data. The data is then converted into the Watch Entities and Properties before being passed to the Data Merging and Linking component, which enriches the data before inserting it into the Watch Knowledge Base.
The adaptor is in an early state of development, but provides an example of how to develop a plug in adaptor for the Preservation Watch system.
- RODA Adaptor
- eSciDoc Adaptor
- Component Catalgue Adaptor
The component catalogue adaptor will gather data from the SCAPE Preservation Component Catalogue via the Component Lookup API. The information gathered in the Knowledge Base will allow planners to find tools that meet their planning requirements, or alert them when tools that provide new functionality required for preservation activities that had not previously been available.
- Policy Model Adaptor
A policy model adaptor is assumed to be the form of the Preservation Watch component required to incorporate the Machine Interpretable Policy Model into the Automated Watch Knowledge base. This source will allow the Automated Watch system to query an organisation's preservation policy and generate notifications when there are significant policy changes.
Locate the push source adaptor code, or discover timetable definition / development.
This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base. The source adaptors convert data to fit the internal data model, but different source adaptors may present contradictory, or incompatible data. The merging and linking component further process incoming data by:
- Merging data.
- Resolving inconsistencies between data sources.
- Providing additional cross-references between entities and properties.
It is the addition of the cross-references that will enable rich queries of the Knowledge Base. A simple example of the added value by the merging and linking component can be given by considering 3 simple sources:
- A MIME based format registry
- The PRONOM format registry
- A collection profile consisting of file format distribution of the collection
The format records gathered from the registries would need to be linked so that PRONOM IDs were linked with the appropriate MIME records. Additionally the file format distribution records would be linked to the appropriate format entities gathered from the format registries.
The component provides an internal Delegate Data interface used by Source Adapters and the Push Source API to push information to it, and makes use of the Knowledge Base's Submit Data API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.
The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely Submit Data and Access Data.
A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process.
It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.
The Knowledge Base will also store the questions submitted to the Automated Watch system through the REST API. The questions may have been submitted by a human operator using the web interface or an automated component calling the REST API directly.
The Knowledge Base uses Apache Jena, a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.
The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a Data or Question Changed interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the Assessment Service.
Monitoring services frequently reasses q
This sub-component is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base and providing the Monitor with information about whether to send a Notification. It is instigated by the Monitor sub-component in response to updated data being available in the Knowledge Base or an update to the Watch Request. Access to the Knowledge Base is provided by the internal Access Data interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred.
When the Monitor sub-component detects a significant event, based upon the the questions and conditions stored in the Knowledge base, the Notification Service is used to alert interested parties.
This is a web interface which provides the following functionality:
- The manual addition of information to the knowlege base.
- Browsing of the knowledge base.
- Querying of the knowledge base, or asking questions.
- Creation of conditions so they are notified when significant events occur.
Externally, this tool presents two main interfaces:
The REST API provides identical functionality as the Client Web GUI, but to software components.
The Preservation Watch Push API provides an interface for external systems (sources) to submit information to the knowledge base. Rather than develop an adaptor to be managed by the Watch Component.
With current state-of-the-art procedures in digital preservation we can define organisational contraints and we can create plans that treat a homogenous sub-set of a large repository. PLANETS defined a Preservation Plan as follows:
A preservation plan defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records (called collection). The Preservation Plan takes into account the preservation policies, legal obligations, organisational and technical constraints,user requirements and preservation goals and describes the preservation context, the evaluated preservation strategies and the resulting decision for one strategy, including the reasoning for the decision. It also specifies a series of steps or actions (called preservation action plan) along with responsibilities and rules and conditions for execution on the collection. Provided that the actions and their deployment as well as the technical environment allow it, this action plan is an executable workflow definition.REF
PLANETS also produced a preservation planning methodology, a structured workflow for creating, testing and evaluation preservation plans. The PLATO planning tool developed within PLANETS follows this workflow to build preservation plans. PLATO produces an executable preservation plan along with audit evidence documenting the decision making procedures used in creating the planREF. However the plans were:
- Largely constructed manually, which could be a time intensive procedure.
- Were not normally applicable to all of an organisations holdings, but were restricted to a, normally homogenous, sub-set of a collection.
- Were not deployed and executed automatically in a repostitory.
- Had to be monitored manually for changes in best practice, collection profile, etc.
Further no mechanism exists to relate preservation policies to preservation plans, correlation has to be done manually.
The goals of SCAPE are to provide an automated planning component that is informed by:
- The accumulated knowledge of previous preservation plans
- An organisation's digital preservation policy
- An organisation's digital collections
The SCAPE planning component continues the development of the PLATO Planning tool used in the PLANETS project. As described the PLANETS PLATO tool was capable of producing executbale preservation plans to an established preservation planning methodology.
This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the decision maker (preservation manager?) along with a number of features facilitating systematic analysis.
Check terminology: decision maker = preservation manager? Prefer to limit, condense and be consistent with the terminology used.
The Automated Planning work package is responsible for the development of of the
Preservation policies are governance statements that constrain or drive operation Preservation Planning but may also have other effects outside of operational planning. For Planning and Watch policy elements have been divided into 3 classes:
- Guidance Policies
- strategic, high level policies
- are expressed in natural language
- can't be expressed in machine interpretable form and require human interpretation
- Procedural Policies:
- model the relation between guidance policies and control policies
- can be represented in a formal model as the relation between guidance and control policies
- Control Policies:
- are specific and can be represented in a semantic model
Only the control policies are guaranteed to be represented in the machine interpretable policy model. The development of the machine interpretable policy model is led by the development of a catalogue of policy elements.
The policy element catalogue provides a semantic representation of generic policy elements that is understandable by preservation systems. The intital version of the policy catalogue lists a set of Guidance Policies which, by definition, will not appear in the machine interpretable model in their full form, as a table in Deliverable. Instead these must be broken down into sets of Procedural Policies, which in turn will be represented by sets of Control Policies that will be used to create the machine interpretable policy model. The iterative process of refining the catalogue will be undertaken by using the catalogue to express the real preservation policies of three partners representing the needs of Large Scale Digital Repositories, Web Archives, and Scientific Data Sets. Once validated the catalogue will be used to develop the machine interpretable model.
The machine interpretable policy model provides a source for the Automated Watch system and will inform the Automated Planning system. Standard tools such as RDF/OWLREF will be used to define the terms used to describe and represent Control Policies and support policy reasoning. Similarly to the catalogue, the policy model will undergo an iterative process of testing and refinement while been used to model the various Testbed scenarios.
There is a GitHub project where the semantic model of low-level Control Policies is being developed. The project contains:
- The current version of the policy model ontology.
- Some example properties, criteria, ojectives, and scenarios.
- Some experimental queries developed in Java.
There are no recognised interfaces developed as part of the policy modelling workpackage. The Automated Watch and Automated Planning components are both responsible for developing software components that will interpret the model and base decisions upon the policy elements. The Policy Modelling work package is responsible for ensuring technical interoperablility between these components and the policy model.
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. [Systematic planning for
digital preservation: Evaluating potential strategies and building preservation plans International Journal
on Digital Libraries (IJDL), December 2009.] | http://publik.tuwien.ac.at/files/PubDat_180752.pdf]
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. [Plato: A Service Oriented Decision Support System
for Preservation Planning | http://publik.tuwien.ac.at/files/PubDat_170832.pdf]