The organisations participating in demonstration activities have created or are in the process of building the technical environments as a basis for installing the SCAPE Platform and selected preservation components. In this section, the architecture and hardware specifications of these so-called “local instances” will give an insight into the environment where the demonstration will take place.
|BL Hadoop Developer Platform||The British Library architecture allows for an incremental development cycle, starting with a small data set using a local Hadoop instance running within a virtual machine, up to large scale processing using the BL’s developer Hadoop cluster and beyond. Testbed content is held either in the Digital Preservation Team’s repository or on the NAS directly as appropriate.The development Hadoop environment at the BL is a VMWare ESXi cluster with 32 CPUs, 224GB RAM and ~27TB HDD. It is currently configured to have 30 1CPU nodes; 1 manager, 1 namenode/jobtracker and 28 datanode/tasktrackers, each with 1CPU/6GB RAM/500GB HDD.|| British Library (BL)
| Hadoop Cluster at Internet Memory Foundation
|| The local instance at IM comprises of two distinct multi-node hardware clusters designed to retrieve and archive web content. The first cluster (in the figure on the left) represents the distributed crawler. It uses a distributed hash table to assign crawled URLs and DRUM data structureand to manage those across the nodes in the cluster. The crawler produces WARC file containers on local disks which are then asynchronously transferred into HDFS on the second cluster for further processing/storage.
The second cluster of computers is dedicated to processing and storage of the crawled data. This system builds upon Apache Hadoop and HBase running collocated on the nodes. The in-house developed extraction platform enables the user to create a data specific workflow of “extractions” used to further derive information from the “raw” data (detect mime type, extract plain text, and detect news articles). The output of one extractor can of course serve as an input for another extractor. When the processing of a workflow is finished, the data is ordered and materialized in a bulk fashion in HBase.
| Internet Memory Foundation (IM)
| Hadoop Cluster at the Austrian National Library
||At the Austrian National Library, a dedicated experimental cluster has been set up for the SCAPE project. First, the hardware for the cluster consisting of one controller and five worker nodes was selected and then the installation of the operating system together with a Cloudera Hadoop distribution (CDH) as a basis for a SCAPE Platform installation was prepared. In the following, an overview on the hardware, software and the SCAPE system for workflow demonstration is presented according to the current planning.|| Austrian National Library (ONB)
|Hadoop Cluster using Isilon at Statsbiblioteket, Denmark||The Hadoop cluster consists of four servers and an Isilon scale out NAS solution from EMC for storage. The cluster uses the Cloudera 4.5 distribution.|| Statsbiblioteket, Denmark (SB)
| Hadoop Cluster at the Science and Technology Facilities Council
||The Hadoop cluster used so far in SCAPE at the Science and Technology Facilities Council (STFC) is maintained and provided by other members of the STFC Scientific Computing Department and is a test facility only available within the STFC firewall. The Hadoop cluster has six slaves providing 70TB of HDFS storage and 24 MapReduce slots; these are managed by two virtual machine head-nodes, the HDFS NameNode and MapReduce Job Tracker. Each slave machine has both a map reduce task tracker and HDFS DN to enable the minimum movement of data to the compute resource. The OS used is Scientific Linux which is a scientifically enhanced version of Red Hat Enterprise Linux.||The Science and Technology Facilities Council (STFC)|
The following tables provide an overview of the demonstration assets created in the SCAPE project. They include the SCAPE Platform, SCAPE Preservation Components, Preservation Watch and Commercial Products.
|SCAPE Execution Platform|| The SCAPE Platform consists of a basic Apache Hadoop Installation together with a set of related components from the Apache Hadoop ecosystem, like Apache HBase , Hive and Pig and of SCAPE components providing additional interfaces and preservation specific capabilities.
The SCAPE Preservation Components are divided into the following categories: Characterisation services, Action services and Quality assurance services.
||The FITS tool from Harvard University has been in extensive use by SB in connection with SCAPE. This tool supports the ”characterisation” part of the User Story ”File Format Identification and Characterisation” , but also the gathering of statistical data for Planning and Watch, as input for the Scout tool.||ONB, SB|
|raw2nx||The raw-to-nexus migration is an example for migrating large volumes of scientific data sets which, depending on the scientific instrument that creates the data, can be either a huge amount of small files or very large files.||STFC|
|Jpylyzer||Jpylyzer is a validator and feature extractor for JP2 images (the still image format that is defined by JPEG2000 Part 1 - ISO/IEC 15444-1)||BL, ONB|
|Matchbox||Matchbox is a very computing intensive quality assurance component which uses a SIFT feature detector to determine key points in an image, which are later used to compare the image with other images in the collection||ONB|
|xcorrSound||xcorrSound is a package containing three tools used for audio file analysis. It consists of overlap-analysis which is a tool to find the overlap between two audio files, waveform-compare which is a tool that compares the content of two audio files and outputs the similarity, and sound-match which is a tool to find all occurrences of a shorter wav within a larger wav.||SB|
|Pagelyzer||Pagelyzer is a tool for comparison of two versions of a web page in the context of for web archiving. The tool is based on a combination of structural and visual comparison methods, a visual similarity measure designed for Web pages that improves change detection, and a supervised feature selection method adapted to Web archiving. A Support Vector Machine model is trained with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not.||IM|
|Plato||The preservation planning tool Plato is a decision support tool that implements a solid preservation planning process and integrates services for content characterisation, preservation action and automatic object comparison in a service-oriented architecture to provide maximum support for preservation planning endeavours.||ONB|
|2Exlibris Rosetta||The Rosetta demonstration for SCAPE will be hosted at ExLibris and allow users viewing 4 newspapers (sample data from BL) that are saved as Intellectual Entities (IEs). Each IE has 2 different Representations||EXL|
|Microsoft Azure||SCAPE Azure services will be available as a web portal with functions that support a four step workflow for batch-mode document conversion: ingest and characterisation of document collections, conversion, comparison, and reporting.||MSR|
3. Preservation User Stories: Stories and Experiments
The following provides an overview of the User Stories for the different SCAPE Testbeds, which shall be implemented in the context of the described demonstrations.
The Web Content Testbed user stories represent the real world challenges in the area of web content preservation. The specific challenge of web archives compared to other application areas is the heterogeneity of its content, especially the huge variety of digital objects of different file formats because a web archive contains text content, images, audio, and video content where only a minor part strictly follows the corresponding file format specifications. Therefore these user stories explain the high level requirements that ensure the future accessibility to the content.
|ARC to WARC Migration||The ARC to WARC migration user story relates to the question how web archive content should actually be stored for the long term. Originally, content was stored in the ARC format, a format developed by the Internet Archive together with the Heritrix Web Crawler software which produced these files as the default persistent storage file format for crawled web sites. The format was designed to hold multiple web resources aggregated in a single – optionally compressed – container file. But this format was not supposed to be an ideal format to store content for the long term, for example, it was lacking features that allow adding contextual information in a standardised way. For this reason, the new WARC format as an ISO Standard was created to provide additional features, especially the ability to hold harvested content as well as any meta-data related to it in a self-contained manner.||ONB|
|Comparison of Web snapshots||Web Archiving consist in capturing web content, which is per se heterogeneous, complex and highly ephemeral. When captured, resources are stored into standard archiving format, the (W)ARCs and viewable online thanks to access tools recreating the website look and feel. Each of these steps contains challenges of its own that impact web archives quality.||SB, ONB|
|File Format Identification and Characterisation of Web Archives|| Web archive data is very heterogeneous. Memory institutions doing web archiving have a an implicit or explicit policy that determines which type of material is collected. Therefore, data may be text documents in all kinds of text encoding, html content loosely following different HTML specifications, audio and video files that were encoded with a variety of codecs, etc..
In order to take any decisions in preservation, it is indispensable to have detailed information about the content in the web archive, especially those pieces of information that preservation tools depend on. This can lead to different views regarding the prioritisation of which type of content and which properties of that content needs special attention in what concerns preservation planning.
The main issue that we are dealing with in this deliverable, is the question how these actions can be achieved using workflows that process large amounts of web archive content at scale.
| ONB, SB, BL
The Large Scale Digital Repositories user stories represent the real world challenges in the area of preserving large collections of digital objects in content repositories. The specific challenge of this Testbed is the large number of items contained in digital collections which are managed and preserved by a repository software with defined data ingest and data manipulation procedures.
|Large Scale Audio Migration||As the owner of a large audio collection, I need a digital preservation system that can migrate large numbers of audio files from one format to another and ensure that the migration is a good and complete copy of the original||SB|
|Large Scale Image Migration||As a curator of image files, I need a digital preservation system that can migrate a large number of images from one format to another, ensuring that the migrated images conform to our institutional profile, that no image data is lost and that the migration is cost effective (saving storage for example)||BL|
|Policy-Driven Identification of Preservation Risks in Electronic Document Formats||As a Digital Library holding a large number of electronic documents from various sources, I need a digital preservation system that can help me to identify preservation risks within these files to ensure that my institution is aware of preservation risks and can sustainably manage the content||BL|
| Quality Assurance of Digitized Books
||As a cultural heritage institution, we need a digital preservation system that can identify books within a large digital book collection that contain duplicated book pages and inform us of the pages within those books that are duplicate images||ONB|
|Validation of Archival Content Against an Institutional Policy||As a memory institution, I need to be able to ensure that content in our repositories conforms both to its file format specification and (where appropriate) the profile of that format as specified by the institutional policies. This is to ensure that our content conforms to existing preservation policies and also that content we ingest is acceptable within the bounds of those policies.||SB|
|Migration from Local Format to Domain Standard Format||As the content holder/manager of scientific data held in a local format, I wish to migrate this data into a domain standard format to reduce the risks of losing the ability to read/use and reuse the data contained within the file format||STFC|
|Normalise to Disparate Tabular Data Sources [Stopped]||In order to ensure the long-term survival of a research dataset we need to ensure that the copy we hold is manageable, contains the relevant data and is in a format that promotes digital preservation. To this end we need a digital preservation system that can extract data from disparate tabular data sources, compile that data into a single preservation format output file and verify that the relevant data is present.||BL|
|Preserving the Context and Links to Research Data or Preserving Research Objects||As a content holder/manager of research data I wish to collect together objects or links to other objects relating to the data object which will enable the data object to remain useable/reusable over the long term||STFC|
| British Library (BL)
||William Palmer||[email protected]|
| Internet Memory Foundation (IM)
||Leila Medjkoune||[email protected]|
|Microsoft Research (MSR)||Ivan Vujic||[email protected]|
| Austrian National Library (ONB)
||Sven Schlarb||[email protected]|
| Statsbiblioteket (SB)
||Rune Ferneke-Nielsen||[email protected]|
| Science & Technologies Facilities Council (STFC)
||Catherine Jones||[email protected]|
Back to Demonstrations main page