Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Current state-of-the-art digital preservation procedures create plans that specify the preservation actions to be applied to well-understood and homogeneous parts of the content held in a repository, whilst conforming to specified objectives and constraints. A key goal of the SCAPE project is to develop appropriate mechanisms in order to help automate the initiation, monitoring, execution and evolution of such plans and help react to a dynamically changing environment and user behaviour. That is, to advance the control of digital preservation actions from ad-hoc decision making to a proactive, continuous preservation management.

#AutomatedWatch

Automated Watch Component

Schedule

Milestones
MS55 First prototype of the simulation environment due M20
MS56 First version of the preservation watch core services due M22
MS57 First prototype of the watch component delivered including adaptors for repositories, and Web content due M28
Deliverables
D12.1 Identification of triggers and preservation watch component architecture, subcomponents and data model M12
D12.2 Final version of the Preservation Watch Component due M38
D12.3 Final version of the Simulation Environment due M42

Introduction

The aim of the Automated Watch work package is to substantially improve automated support for effective digital preservation watch. To do this it is:

  • Increasing the breadth and scope of collected digital preservation information.
  • Normalizing and structuring gathered information into a queryable Knowledge Base.
  • Allow both human and automated systems to ask questions about information in the Knowledge Base.
  • Developing software components to monitor information in the Knowledge Base for significant events.

Functional Description

The Automated Watch component is an automated monitoring system that identifies:

  • Preservation risks.
  • Curatorial opportunites (e.g. cost reduction).
  • Possible shortcomings in current preservation actions.

This component provides the mechanisms for gathering information from various sources, including digital content and repositories, institutional policies, designated user communities, and other systems. A three-tier architecture has been deveoped, depicted below:

Information is gathered via pull adaptors, developed to normalize and aggregate data from external sources, alternatively sources can push information to the watch system via the push source API. Adding new sources to the system means developing a compliant adaptor.

Questions:

  1. [Watch Component] What is the status of the Watch Component and its sub-components? implemented the PRONOM and Content Profile adaptors, the knowledge base, the email notification and the assessment. We are now implementing the monitoring, fitting the components together and developing the REST API

The Watch component comprises a number of sub-components that each add specific functionality to achieve the goal of monitoring the "state of the world" through various Sources of information and providing notifications to the planner. A planner is able to make a Watch Request, either synchronous or asynchronous, to the Watch component via a Client Service in order to query and be notified about some specific measurement(s) of interest. Synchronous Watch Requests are used to query for a specific measurement at a specific point in time, blocking the requesting client until the response is returned. Asynchronous requests are used to set the Watch component to monitor for changes in specific measurements (by specifying Conditions), triggering a Notification, for example an email, to the requesting client when such a change is detected. This approach does not block the requesting client. The notification type can be set when initiating the Watch Request.

The following sub-sections discuss the various sub-components involved in the Watch Component and how they interact.

Sources

Although not strictly a part of the Watch component, Sources are described here to aid understanding of the Watch sub-components. A Source represents specific aspects of the world for which there is a way of measuring the properties associated with it, and can be internal or external to the project. Key sources currently considered are:

  • Format Registries
  • SCAPE Preservation Components catalogue (MyExperiment)
  • Policy models
  • Repositories
  • Experiment Results
  • Content Profiles
  • Human Knowledge
  • Web Browser snapshots (being developed within Watch)
  • Simulator to assess Planning and Watch decisions (being developed within Watch)

Sources are coloured pink in Figure 1 implying that, although they may also connect to other SCAPE components, they will interact with Source Adapters through either the Source Access Pull API or the REST Source Push API. An exception would be the Digital Object Repository which implements a Report API for interaction with a relevant Source Adapter.

Source Adapters

A Source Adapter gathers information from a Source and transforms it to fit the Entity/Property model adopted for the Knowledge Base. There are two approaches to achieving this, push or pull, the choice of which to use will depend on multiple factors such as whether the Source is Watch component agnostic or whether it is possible to create software to run on the Source.

In the push model, the Source will send information to the Watch component as and when it becomes available. Software must be developed for the Source component to achieve this, which in some circumstances this not be possible. The pull model ideally relies on the Source component providing a network accessible API to enable a relevant Source Adapter to request information directly, most likely on a periodic basis, however if no such API exists, then the adapter will have to extract information from the format made available by the Source (for example, HTML parsing of a web page). The frequency with which data is requested by a Source Adapter is controlled by the Monitor sub-component through the internal Adapter Configure interface.

The Source Adapters employed should map to the Sources being used. The SCOUT preservation watch project contains two reference adaptors described below.

PRONOM Adaptor - A Reference Format Registry Adaptor

There is a PRONOM source adaptor developed as part of the Preservation Watch process. The adaptor queries the PRONOM Linked Data SPARQL endpoint and transforms the returned JSON into Entities/Properties for passing on to the Merging and Linking component. The Adaptor code is part of the SCOUT project and can be found here .

C3PO Adaptor - A Reference Content Profile Adaptor

C3PO is a content profiling tool developed outside of the SCAPE project. It doesn't perform any characterisation, instead it parses output from the [FITS tool http://code.google.com/p/fits/] and aggregates into a MongoDB document database. There is a also a tool that retrieves FITS records from the RODA repository for consumption by C3PO. C3PO also provides a web based tool to view and the aggregated data and a REST API for retrieving the aggregated profile data. The C3PO project an be found on GitHub here .

The C3PO adaptor reads and parses the XML data generated by the C3PO REST API and retrieves a subset of the content profile data. The data is then converted into the Watch Entities and Properties before being passed to the Data Merging and Linking component, which enriches the data before inserting it into the Watch Knowledge Base.

The adaptor is in an early state of development, but provides an example of how to develop a plug in adaptor for the Preservation Watch system.

Other Source Adaptors
  • RODA Adaptor
  • eSciDoc Adaptor
  • Component Catalgue Adaptor
    The component catalogue adaptor will gather data from the SCAPE Preservation Component Catalogue via the Component Lookup API. The information gathered in the Knowledge Base will allow planners to find out when tools that provide new functionality that
  • Policy Model Adaptor
    A policy model adaptor is assumed to be the form of the Preservation Watch component required to incorporate the Machine Interpretable Policy Model into the Automated Watch Knowledge base. This source will allow the Automated Watch system to query an organisation's preservation policy and generate notifications when there are significant policy changes.
Push Source Adaptor API
TODO
Locate the push source adaptor code, or discover timetable definition / development.

Data Merging and Linking

This sub-component acts as a processing layer between the Source Adapters and the Knowledge Base. The source adaptors convert data to fit the internal data model, but different source adaptors may present contradictory, or incompatible data. The merging and linking component further process incoming data by:

  • Merging data.
  • Resolving inconsistencies between data sources.
  • Providing additional cross-references between entities and properties.
    It is the addition of the cross-references that will enable rich queries of the Knowledge Base. A simple example of the added value by the merging and linking component can be given by considering 3 simple sources:
  1. A MIME based format registry
  2. The PRONOM format registry
  3. A collection profile consisting of file format distribution of the collection
    The format records gathered from the registries would need to be linked so that PRONOM IDs were linked with the appropriate MIME records. Additionally the file format distribution records would be linked to the appropriate format entities gathered from the format registries.

The component provides an internal Delegate Data interface used by Source Adapters and the Push Source API to push information to it, and makes use of the Knowledge Base's Submit Data API to submit data for permanent storage in the Knowledge Base. These internal APIs are being defined by Watch.

Knowledge Base

The Knowledge Base is responsible for storing representation information about the world using a model based on Entities and Property Values. Ultimately, each Entity describes a specific set of values that are measurements of each Property at a specific moment in time. For example, for a "format" Entity, relevant Properties might be "name" (e.g. JPEG2000), "version" (e.g. 1.0), or "tool support" (e.g. limited); over time, the tool support for JPEG2000 may increase, therefore at a later point a new Entity may indicate "tool support" as "widespread". Relevant internal APIs are provided to store and retrieve data from the Knowledge Base, namely Submit Data and Access Data.

A history of all knowledge gathered is kept in order to allow the Knowledge Base to be queried for past data thereby enabling repeatability of the decision making process.

It is planned to use RDF Linked Data as the model for storing data in the Knowledge Base, as this enables a simplified, generic and more flexible data representation than a relational data model. Ontology stores already implement useful features such as boolean and algebraic logic, and provide the ability for complex queries, due to their nature at capturing concepts and relationships, which will be useful for framing and answering the Watch Request questions. The SPARQL query language is planned to be used to represent Watch Request questions.

The Knowledge Base will also store the questions submitted to the Automated Watch system through the REST API. The questions may have been submitted by a human operator using the web interface or an automated component calling the REST API directly.

The Knowledge Base uses Apache Jena, a Java framework for storing and queying large RDF datasets. Jena also provides support for OWL ontologies and a rule-based inference engine for reasoning with RDF and OWL data sources.

Questions:

  1. [Knowledge Base: Linked Data] I assume RDF/XML serialisation is what's being used?

Monitor

The Monitor sub-component provides a mechanism for continuously watching the Knowledge Base for changes to specific Watch Requests the client is interested in. To do this, it provides a Data or Question Changed interface for being notified about changes to the underlying data or the Watch Requests themselves. Upon receiving such an update, this sub-component will identify which Watch Requests require re-evaluation and instigate this re-evaluation through the Assessment Service.

Questions:

1. [Watch Component: Monitor] Given the Watch component architecture has evolved, are these 4 monitoring services still planned: Repository Monitor, Format Registry Monitor, Component Catalogue Monitor, and Policy Model Monitor

Monitoring services frequently reasses q

Assessment Service

This sub-component is responsible for evaluating Watch Requests utilising the latest information from the Knowledge Base and providing the Monitor with information about whether to send a Notification.  It is instigated by the Monitor sub-component in response to updated data being available in the Knowledge Base or an update to the Watch Request. Access to the Knowledge Base is provided by the internal Access Data interface, and the information received is compared against a Watch Request Trigger to determine if a significant event has occurred.

Notification Service

When the Monitor sub-component detects a significant event, based upon the the questions and conditions stored in the Knowledge base, the Notification Service is used to alert interested parties.

Questions
Are interested parties always people. Presumably some are automated systems, i.e. inform a preservation aware repository that a new format risk has emerged.

Client Service

This is a web interface which provides the following functionality:

  • The manual addition of information to the knowlege base.
  • Browsing of the knowledge base.
  • Querying of the knowledge base, or asking questions.
  • Creation of conditions so they are notified when significant events occur.

Required Interfaces

Externally, this tool presents two main interfaces:

Watch Request REST API

The REST API provides identical functionality as the Client Web GUI, but to software components.

Questions
Presumably the Client GUI simply makes REST API calls, avoiding duplicate implementations and associated maintenance issues.

REST Push API

The Preservation Watch Push API provides an interface for external systems (sources) to submit information to the knowledge base. Rather than develop an adaptor to be managed by the Watch Component.

#AutomatedPlanning

Automated Planning Component

Schedule

Milestones
MS61 Initial version of automated policy-aware planning component M18
MS62 Automated policy-aware planning component v2 with full lifecycle support M32
MS63 Report on compliance validation M40
Deliverables
D14.1 Report on decision factors and their influence on planning M10
D14.2 Final version of automated policy-aware planning component due M42

Introduction

With current state-of-the-art procedures in digital preservation we can define organisational contraints and we can create plans that treat a homogenous sub-set of a large repository. PLANETS defined a Preservation Plan as follows:

A preservation plan defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records (called collection). The Preservation Plan takes into account the preservation policies, legal obligations, organisational and technical constraints,user requirements and preservation goals and describes the preservation context, the evaluated preservation strategies and the resulting decision for one strategy, including the reasoning for the decision. It also specifies a series of steps or actions (called preservation action plan) along with responsibilities and rules and conditions for execution on the collection. Provided that the actions and their deployment as well as the technical environment allow it, this action plan is an executable workflow definition.REF

PLANETS also produced a preservation planning methodology, a structured workflow for creating, testing and evaluation preservation plans. The PLATO planning tool developed within PLANETS follows this workflow to build preservation plans. PLATO produces an executable preservation plan along with audit evidence documenting the decision making procedures used in creating the plan2. However the plans were:

  • Largely constructed manually, which could be a time intensive procedure.
  • Were not normally applicable to all of an organisations holdings, but were restricted to a, normally homogenous, sub-set of a collection.
  • Were not deployed and executed automatically in a repostitory.
  • Had to be monitored manually for changes in best practice, collection profile, etc.

Further no mechanism exists to relate preservation policies to preservation plans, correlation has to be done manually.

Aims

The goals of SCAPE are to provide an automated planning component that is informed by:

  1. The accumulated knowledge of previous preservation plans
  2. An organisation's digital preservation policy
  3. An organisation's digital collections

The Automated Planning Tool

The SCAPE planning component continues the development of the PLATO Planning tool used in the PLANETS project. As described the PLANETS PLATO tool was capable of producing executbale preservation plans to an established preservation planning methodology.

Web-based Analysis Tool

This web-based tool supports the systematic and repeatable assessment of decision criteria and is fully compatible with the Plato planning tool. It enables decision makers to share their experiences and in turn build upon knowledge shared by others. Preservation plans are loaded from the planning tool's knowledge base, processed and anonymised, before being presented to the decision maker (preservation manager?) along with a number of features facilitating systematic analysis.

Todo
Check terminology: decision maker = preservation manager? Prefer to limit, condense and be consistent with the terminology used.
Policy Aware Planning Component

The Automated Planning work package is responsible for the development of of the

Preservation Policy Modelling

Preservation policies are governance statements that constrain or drive operation Preservation Planning but may also have other effects outside of operational planning. For Planning and Watch policy elements have been divided into 3 classes:

  1. Guidance Policies
    • strategic, high level policies
    • are expressed in natural language
    • can't be expressed in machine interpretable form and require human interpretation
  2. Procedural Policies:
    • model the relation between guidance policies and control policies
    • can be represented in a formal model as the relation between guidance and control policies
  3. Control Policies:
    • are specific and can be represented in a semantic model

Only the control policies are guaranteed to be represented in the machine interpretable policy model. The development of the machine interpretable policy model is led by the development of a catalogue of policy elements.

Policy Element Catalogue
Schedule

Milestones
MS58 List of high-priority policy elements that must be fed into preservation plans M6
MS59 Initial version of policy element catalogue available M12
Deliverables
D13.2 Catalogue of preservation policy elements due M36

The policy element catalogue provides a semantic representation of generic policy elements that is understandable by preservation systems. The intital version of the policy catalogue lists a set of Guidance Policies which, by definition, will not appear in the machine interpretable model in their full form, as a table in Deliverable. Instead these must be broken down into sets of Procedural Policies, which in turn will be represented by sets of Control Policies that will be used to create the machine interpretable policy model. The iterative process of refining the catalogue will be undertaken by using the catalogue to express the real preservation policies of three partners representing the needs of Large Scale Digital Repositories, Web Archives, and Scientific Data Sets. Once validated the catalogue will be used to develop the machine interpretable model.

Questions:

The form of the validated policy catalogue is not completely clear. In the description of work the validated catalogue is referred to as the "starting point" for the machine interpretable model. Is the validated catalogue still to be a word document, or something closer to the RDF/OWL machine interpretable model?

Machine Interpretable Policy Model
Schedule

Milestones
MS60 Initial version of machine understandable policy specification model based on semantic technologies M15
Deliverables
D13.1 Final version of policy specification model due M30

The machine interpretable policy model provides a source for the Automated Watch system and will inform the Automated Planning system. Standard tools such as RDF/OWLREF will be used to define the terms used to describe and represent Control Policies and support policy reasoning. Similarly to the catalogue, the policy model will undergo an iterative process of testing and refinement while been used to model the various Testbed scenarios.

Technical Components
There is a openplanets/policy GitHub project where the semantic model of low-level Control Policies is being developed. The project contains:

  • The current version of the policy model ontology.
  • Some example properties, criteria, ojectives, and scenarios.
  • Some experimental queries developed in Java.

Interfaces
There are no recognised interfaces developed as part of the policy modelling workpackage. The Automated Watch and Automated Planning components are both responsible for developing software components that will interpret the model and base decisions upon the policy elements. The Policy Modelling work package is responsible for ensuring technical interoperablility between these components and the policy model.

Packaging and Deploying

Questions:

  1. [Watch Component: Packaging and Deploying] How are we intending to package/deploy the Watch Component and Planning Tool

References

#APREF1 [Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. "Systematic planning for
digital preservation: Evaluating potential strategies and building preservation plans" International Journal
on Digital Libraries (IJDL), December 2009. | http://publik.tuwien.ac.at/files/PubDat_180752.pdf]

#2 [Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., and Hofman, H. "Plato: A Service Oriented Decision Support System
for Preservation Planning"|http://publik.tuwien.ac.at/files/PubDat_170832.pdf]

#MIMPREF1 W3C OWL Working Group

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.