View Source

h2. SCAPE: Big Data meets Digital Preservation

Ross King, Rainer Schmidt, AIT Austrian Institute of Technology GmbH

Christoph Becker, Technical University of Vienna

Sven Schlarb, Austrian National Library


h3. Introduction

The fact that the volume of digital content worldwide is increasing geometrically demands that preservation activities become more scalable. The economics of long-term storage and access demand that they become more automated. The present state of the art fails to address the need for scalable automated solutions for tasks like the characterization or migration of very large collections. Standard tools break down when faced with very large or complex digital objects; standard workflows break down when faced with a very large number of objects or heterogeneous collections. And yet, the collections at European memory institutions are growing larger every day - many of which are already in the Petabyte range. In short, digital preservation is becoming an application area Big Data.


The EU FP7 ICT Project SCAPE (Scalable Preservation Environments), running since February 2011, was initiated in order to address some of these Big Data problems. In particular, data analysis and scientific workflow management play an important role.

h3. Testbeds

The Testbeds validate results of the SCAPE project in three different application areas: Digital Repositories from the library community, Web Content from the web archiving community, and Research Data Sets from the scientific community. The Testbeds provide a description of issues that different internal or external institutions are currently facing with a special focus on large scale data sets that imply a real challenge for solutions to scale up. Solutions that are going to be evaluated by the Testbeds comprise single tool solutions, like tools for data format migration, analysis and identification or quality assurance tools, to complex solutions like the SCAPE Platform or Planning and Watch services. The Testbeds create an implementation of the scenarios and ensure the evaluation of results by creating executable, software-based workflows on the basis of the workflow design and execution workbench Taverna that make use of the different kinds of tools and services created by the project. Complex and single tool solutions are evaluated against defined institutional data sets in order to gain knowledge about their applicability in real life institutional application scenarios, like data repository ingest workflows or data archive maintenance, for example.

h3. Data Analysis and Preservation Platform

The _SCAPE Platform_ provides an extensible infrastructure for the execution of digital preservation workflows on large volumes of data. It is designed as an integrated system for content holders employing a scale-out architecture to execute digital preservation processes on archived content. The system is based on distributed and data-centric systems and programming models like Hadoop, HDFS, and MapReduce. A suitable storage abstraction will provide integration with content repositories at the storage level for fast data exchange between the repository and the execution system. Many data sets in SCAPE consist of large binary objects that must be pre-processed before they can be expressed using a structured data model. The Platform will therefore implement a storage hierarchy for processing and archiving content and rely on a combination of distributed database and file system storage. Workflows may be created on desktop computers by end-users (such as data curators) based on assembling components using a graphical workbench. The Platform's execution system supports the transformation of these workflows into programs that can execute on a distributed data processing environment, such as provided by MapReduce implementations. Moreover, the Platform is in charge of on-demand shipping and deploying the required preservation tools on the cluster nodes that hold the data. This is being facilitated based on employing a packaging model as well as a corresponding software repository. For interacting with the system, we aim to support a variety of task, workflow, and query description languages, which will be based on existing techniques such as the Taverna Workbench, Unix pipes, the JAQL query language or Apache Pig.


h3. Scalable Planning and Monitoring

Through its data-centric execution platform, SCAPE will substantially improve scalability for handling massive amounts of data and securing quality assurance without human intervention. But fundamentally, for a system to be truly operational on a large scale, all components involved need to scale up. Only scalable monitoring and decision making enables automated, large-scale systems operation by scaling up the control structures, policies, and processes for monitoring and action. SCAPE will thus address the bottleneck of decision processes and processing information required for decision making. Based on well-established principles and methods \[1\], the project automate now-manual aspects such as constraints modelling, requirements reuse, measurements, and continuous monitoring by integrating existing and evolving information sources and measurements.

h3. Conclusions

The SCAPE project has been running for one year and will continue for another two and one-half years. Initial results are publicly available on the project website: www.scape-project.eu


h3. References

\[1\] Trustworthy Preservation Planning with Plato. Christoph Becker, Hannes Kulovits and Andreas Rauber. ERCIM News 80, January 2010.