Learning Outcomes (by the end of the session the attendees will be able to:

  1. Understand scalable platforms and evaluate the situations in which such environments are required. 
  2. Apply knowledge of existing tools to solve migration and quality control problems. 
  3. Combine and modify tool chains in order to create automated workflows for migration and quality control. 
  4. Implement best practice for discovering and sharing workflows for use and re-use. 
  5. Make use of a scalable environment and apply a number of workflows to automatically perform migration and quality assurance checks on a large number of objects.
  6. Identify a number of potential problems when working in a scalable environment and propose solutions.
  7. Understand the potential to use scalable platforms in digital preservation and synthesise new opportunities within your own environments.

Event Evaluation Survey:


Monday 16 September

Time Session Facilitator Learning outcomes
09.15 - 09.45 Registration and coffee    
09.45 - 09.50 Welcome and housekeeping BL  
09.50 - 10.20 Building Scalable Environments 
Understanding the fundamentals of scalability and why it is important:
  • What are scalable platforms?
  • Why use scalable platforms? 
  • What are the key considerations?
  • What is the SCAPE platform?

Rainer Schmidt, AIT
10.20 - 10.35 Use case
Migrating TIFF to JPEG 2000 at the British Library
Peter May / Will Palmer, BL 1
10.35 - 11.15 Practical exercise
Experiment with a pre-built environment to migrate TIFF to JPEG 2000.
Delegates can bring their own images or use sample files.
Rainer Schmidt (AIT),
Roman Graf (AIT),
Matthias Rella (AIT)
Dave Tarrant, OPF
11.15 - 11.30 Coffee break    
11.30 - 12.45 Migration and Quality Assurance 
Exploring migration and quality control tools for images and understanding
how these are invoked on a single machine instance.
Demonstration and practical exercise:
How ImageMagick and Jpylyzer are run on a single TIFF to JPEG 2000
conversion. This exercise will be carried out in your own local
instance and not built to scale.
Sven Schlarb (ONB)
Carl Wilson (OPF)

12.45 - 13.30 Lunch    
13.30 - 15.00 Workflows 
With the tools explored, we will introduce workflows and look at how these
can be used to invoke multiple operations to both migrate  
content and run quality control checks on the results. 

Demonstration and practical exercise
In this exercise we will create a simple quality-assured image file format
migration workflow. Before starting the actual migration, we check if the TIF
input images are valid file format instances using Fits (JHove under the
hood). If the images are valid TIF images, we migrate them to the JPEG2000 (JP2)
 image file format using OpenJPEG 2.0 and check if the migrated images are
 valid JP2 images using Jpylyzer. Finally, we verify if the migrated JP2 images
 are valid surrogates of the original TIF images by restoring the original TIF
 image from the converted JP2 image and comparing whether original and
restored images are identical .

Again this exercise will be carried out in your own local instance and not built to scale.
Excercise: Follow the Workflows Exercise Worksheet.

Sven Schlarb (ONB)

15:00 - 15.15 Coffee
15.15 - 16:30 How to share your workflow 
Having built a workflow we look at how to share and discover  
other workflows.

Practical exercise:
Describe and upload workflows

Donal Fellows (UNIMAN)  
16.30 - 17.00 Wrap up Dave Tarrant, OPF
Rainer Schmidt, AIT
17.00 Close    
19.30 Event dinner at The Betjeman Arms    

Tuesday 17 September

Time Session Facilitator Learning outcomes
09.00 - 09.15 Coffee, welcome back and overview of agenda for the day Dave Tarrant, OPF  
09.15 - 10.15 Introduction to preservation at scale and the SCAPE Platform
This session introduces the Hadoop platform introduces its 
application for executing preservation workflows in a distributed

More than just "getting the job done" we look at the tools for  
monitoring and controlling complex operations at scale and  
look at how these can be used to identify potential problems.
Sven Schlarb, ONB
Rainer Schmidt, AIT
11.00 - 11.15 Coffee    
11.15 - 12.30 Building Scalable Environments continued

Practical exercises
Set up the Hadoop test installation and running an example which allows comparing local execution with (pseudo-distributed) cluster execution.
Use Hadoop MapReduce for statistical result analysis.
Excercise: Follow the File Format Identification and Result Analysis using Hadoop worksheet.
Rainer/Graf/Rella (AIT)
Sven Schlarb (ONB)

12.30 - 13.30 Lunch    
13.30 - 14.30 Matchbox: Quality Control for digital collections

Invited talkIntroduction to the SCAPE repository reference implementation 
This talk will introduce the SCAPE repository reference implementation
as a guide to get you started with. It will discuss the opportunities and
potential for the future for scalability with respect to digital object
management systems like Fedora 4.
Roman Graf (AIT)

Matthias Hahn (FIZ)


14.45 - 15.00 Coffee break
15.00 - 16.00 Integrating Taverna and Hadoop 
This final session recaps the work that has been done to this point and
allows attendees to fully integrate a number of workflows (both of their
own making as well as existing ones) into scalable preservation
platform on-site.
Excercise: Combine the Hadoop jobs introduced in worksheet File Format Identification and Result Analysis using Hadoop using Taverna.

Sven Schlarb (ONB)
Rella, Schmidt Graf (AIT)

16.00 - 17.00 Panel and wrap up Rainer Schmidt, AIT
Dave Tarrant, OPF
17.00 Close    

Useful Links

Note that while the last link is specific to a network of Raspberry Pis, parts of the same guide can be used to setup your own hadoop cluster without requiring cloudera.

