Planning and Watch
Current state-of-the-art digital preservation procedures create plans that specify the preservation actions to be applied to well-understood and homogeneous parts of the content held in a repository, whilst conforming to specified objectives and constraints. A key goal of the SCAPE project is to develop appropriate mechanisms in order to help automate the initiation, monitoring, execution and evolution of such plans and help react to a dynamically changing environment and user behaviour. That is, to advance the control of digital preservation actions from ad-hoc decision making to a proactive, continuous preservation management.
The Preservation Watch component is an automated monitoring system that identifies:
- Preservation risks.
- Curatorial opportunites (e.g. cost reduction).
- Possible shortcomings in current preservation actions.
This component provides the mechanisms for gathering information from various sources, including digital content and repositories, institutional policies, designated user communities, and other systems. A three-tier architecture has been deveoped, depicted below:
Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor.
The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner. A planner is able to make a Watch Request, either synchronous or asynchronous, to the Watch component via a Client Service in order to query and be notified about some specific measurement(s) of interest. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests are used to set the Watch component to monitor for changes in specific measurements (by specifying Conditions), triggering a Notification, for example an email, to the requesting client when such a change is detected. This approach does not block the requesting client. The notification type can be set when initiating the Watch Request.
The following sub-sections discuss the various sub-components involved in the Watch Component and how they interact.
Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world for which there is a way of measuring the properties associated with it, and can be internal or external to the project. Key sources currently considered are:
- Format Registries
- SCAPE Preservation Components catalogue (MyExperiment)
- Policy models
- Experiment Results
- Content Profiles
- Human Knowledge
- Web Browser snapshots (being developed within Watch)
- Simulator to assess Planning and Watch decisions (being developed within Watch)
Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the Source Access Pull API or the REST Source Push API. An exception would be the Digital Object Repository which implements a Report API for interaction with a relevant Source Adapter.
A Source Adapter gathers information from a Source and delivers it, in a standardised form, to the Watch component for insertion into the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.
In the push model, the Source will send information to the Watch component as and when it becomes available. Relevant software will be needed on the Source component to achieve this, in some circumstances this may not be possible, and so it may not always be possible to employ a push model. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal Adapter Configure interface.
The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors
Link to PW adaptors ON the integration branch on GitHub.
There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into a format for passing on to the Merging and Linking components.
C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the FITS tool
add FITS link
and aggregates into a MongoDB
add Mongo link
document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving the aggregated profile data. C3PO can be found on GitHub
add GitHub link
The C3PO adaptor reads and parses the XML data generated by the C3PO REST API and retrieves a subset of the content profile data, which is then passed to the Data Merging and Linking component, before been pushed to the Data Layer.
The adaptor is in an early state of development, but provides an example of how to develop a plug in adaptor for the Preservation Watch system.
This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base, converting data to fit the internal data model and Knowledge Base, combining information from multiple sources and linking data together to infer some new knowledge. It provides an internal Delegate Data interface used by Source Adapters to push information to it, and makes use of the Knowledge Base's Submit Data API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.
The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely Submit Data and Access Data.
A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process.
It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.
The Knowledge Base uses Apache Jena, a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.
The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a Data or Question Changed interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the Assessment Service.
Monitoring services frequently reasses q
This sub-component is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base and providing the Monitor with information about whether to send a Notification. It is instigated by the Monitor sub-component in response to updated data being available in the Knowledge Base or an update to the Watch Request. Access to the Knowledge Base is provided by the internal Access Data interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred.
When the Monitor sub-component detects a significant event, based upon the the questions and conditions stored in the Knowledge base, the Notification Service is used to alert interested parties.
This is a web interface which provides the following functionality:
- The manual addition of information to the knowlege base.
- Browsing of the knowledge base.
- Querying of the knowledge base, or asking questions.
- Creation of conditions so they are notified when significant events occur.
Externally, this tool presents two main interfaces:
The REST API provides identical functionality as the Client Web GUI, but to software components.
The Preservation Watch Push API provides an interface for external systems (sources) to submit information to the knowledge base. Rather than develop an adaptor to be managed by the Watch Component
This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the decision maker (preservation manager?) along with a number of features facilitating systematic analysis.
Check terminology: decision maker = preservation manager? Prefer to limit, condense and be consistent with the terminology used.