View Source

Planning and Watch

Current state-of-the-art digital preservation procedures create plans that specify the preservation actions to be applied to well-understood and homogeneous parts of the content held in a repository, whilst conforming to specified objectives and constraints. A key goal of the SCAPE project is to develop appropriate mechanisms in order to help automate the initiation, monitoring, execution and evolution of such plans and help react to a dynamically changing environment and user behaviour. That is, to advance the control of digital preservation actions from ad-hoc decision making to a proactive, continuous preservation management.

h3. Watch Component
MS55 First prototype of the simulation environment due M20
MS56 First version of the preservation watch core services due M22
MS57 First prototype of the watch component delivered including adaptors for repositories, and Web content due M28
D12.1 Identification of triggers and preservation watch component architecture, subcomponents and data model M12
D12.2 Final version of the Preservation Watch Component due M38
D12.3 Final version of the Simulation Environment due M42

The Preservation Watch component is an automated monitoring system that identifies:
* Preservation risks.
* Curatorial opportunites (e.g. cost reduction).
* Possible shortcomings in current preservation actions.

This component provides the mechanisms for gathering information from various sources, including digital content and repositories, institutional policies, designated user communities, and other systems. A three-tier architecture has been deveoped, depicted below:

This component provides the mechanisms for gathering information from various sources, including digital content and repositories, institutional policies, designated user communities, and other systems. A three-tier architecture has been deveoped, depicted below:

!Watch Overview.jpg|border=1!

Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor.

# \[Watch Component\] What is the status of the Watch Component and its sub-components?*_ implemented the PRONOM and Content Profile adaptors, the knowledge base, the email notification and the assessment. We are now implementing the monitoring, fitting the components together and developing the REST API{_}*

The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner.  A planner is able to make a Watch Request, either synchronous or asynchronous, to the Watch component via a Client Service in order to query and be notified about some specific measurement(s) of interest. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests are used to set the Watch component to monitor for changes in specific measurements (by specifying Conditions), triggering a Notification, for example an email, to the requesting client when such a change is detected. This approach does not block the requesting client. The notification type can be set when initiating the Watch Request.

The following sub-sections discuss the various sub-components involved in the Watch Component and how they interact.

h4. Sources

Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world for which there is a way of measuring the properties associated with it, and can be internal or external to the project. Key sources currently considered are:
* Format Registries
* SCAPE Preservation Components catalogue (MyExperiment)
* Policy models
* Repositories
* Experiment Results
* Content Profiles
* Human Knowledge
* Web Browser snapshots (being developed within Watch)
* Simulator to assess Planning and Watch decisions (being developed within Watch)

Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the *{_}Source Access Pull API{_}* or the *{_}REST Source Push API{_}*. An exception would be the Digital Object Repository which implements a *{_}Report API{_}* for interaction with a relevant Source Adapter.


h4. Source Adapters

A Source Adapter gathers information from a Source and delivers it, in a standardised form, to the Watch component for insertion into the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.

In the push model, the Source will send information to the Watch component as and when it becomes available. Relevant software will be needed on the Source component to achieve this, in some circumstances this may not be possible, and so it may not always be possible to employ a push model. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal *{_}Adapter _{*}{*}{_}Configure{_}* interface.

The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors {note:title=TODO}Link to PW adaptors ON the integration branch on GitHub.{note} described below.

h5. PRONOM Adaptor - A Reference Format Registry Adaptor

There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into a format for passing on to the Merging and Linking components.

h5. C3PO Adaptor - A Reference Content Profile Adaptor

C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the FITS tool {note:title=TODO}add FITS link{note} and aggregates into a MongoDB {note:title=TODO}add Mongo link{note} document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving the aggregated profile data. C3PO can be found on GitHub {note:title=TODO}add GitHub link{note}.

The C3PO adaptor reads and parses the XML data generated by the C3PO REST API and retrieves a subset of the content profile data, which is then passed to the Data Merging and Linking component, before been pushed to the Data Layer.

The adaptor is in an early state of development, but provides an example of how to develop a plug in adaptor for the Preservation Watch system.

h4. Data Merging and Linking

This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base, converting data to fit the internal data model and Knowledge Base, combining information from multiple sources and linking data together to infer some new knowledge. It provides an internal *{_}Delegate Data _{*}interface used by Source Adapters to push information to it, and makes use of the Knowledge Base's *{_}Submit Data{_}* API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.

h4. Knowledge Base

The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely *{_}Submit Data{_}* and *{_}Access Data._*

A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process.

It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.

The Knowledge Base uses Apache Jena, a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.

# \[Knowledge Base: Linked Data\] I assume RDF/XML serialisation is what's being used?
The SCOUT project uses jenabean to transform Java Beans to

h4. Monitor

The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a *{_}Data or Question Changed{_}* interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the *{_}Assessment Service{_}*.

1. \[Watch Component: Monitor\] Given the Watch component architecture has evolved, are these 4 monitoring services still planned: Repository Monitor, Format Registry Monitor, Component Catalogue Monitor, and Policy Model Monitor

h4. Assessment Service

This sub-component is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base and providing the Monitor with information about whether to send a Notification.  It is instigated by the Monitor sub-component in response to updated data being available in the Knowledge Base or an update to the Watch Request. Access to the Knowledge Base is provided by the internal *{_}Access Data{_}* interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred.

h4. Notification Service

When a



h4. Client Service


h4. Required Interfaces

Externally, this tool presents two main interfaces:

h5. Watch Request REST API

h5. REST Push API

h3. Planning Component
MS61 Initial version of automated policy-aware planning component M18
MS62 Automated policy-aware planning component v2 with full lifecycle support M32
MS63 Report on compliance validation M40
D14.1 Report on decision factors and their influence on planning M10
D14.2 Final version of automated policy-aware planning component due M42



h4. PLATO Planning Tool



h4. Web-based Analysis Tool

This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the decision maker (preservation manager?) along with a number of features facilitating systematic analysis.

Check terminology: decision maker = preservation manager? Prefer to limit, condense and be consistent with the terminology used.

h4. Policy Element Catalogue
MS58 List of high-priority policy elements that must be fed into preservation plans M6
MS59 Initial version of policy element catalogue available M12
MS60 Initial version of machine understandable policy specification model based on semantic technologies M15
D13.1 Final version of policy specification model due M30
D13.2 Catalogue of preservation policy elements due M36


h3. Packaging and Deploying


# \[Watch Component: Packaging and Deploying\] How are we intending to package/deploy the Watch Component and Planning Tool