Learning Outcomes (by the end of the session the attendees will be able to:
- Understand scalable platforms and evaluate the situations in which such environments are required.
- Apply knowledge of existing tools to solve migration and quality control problems.
- Combine and modify tool chains in order to create automated workflows for migration and quality control.
- Implement best practice for discovering and sharing workflows for use and re-use.
- Make use of a scalable environment and apply a number of workflows to automatically perform migration and quality assurance checks on a large number of objects.
- Identify a number of potential problems when working in a scalable environment and propose solutions.
- Understand the potential to use scalable platforms in digital preservation and synthesise new opportunities within your own environments.
Event Evaluation Survey: http://www.surveymonkey.com/s/CPWN6H5
Agenda:
Monday 16 September
Time | Session | Facilitator | Learning outcomes |
---|---|---|---|
09.15 - 09.45 | Registration and coffee | ||
09.45 - 09.50 | Welcome and housekeeping | BL | |
09.50 - 10.20 | Building Scalable Environments![]() Understanding the fundamentals of scalability and why it is important:
|
Rainer Schmidt, AIT |
1 |
10.20 - 10.35 | Use case Migrating TIFF to JPEG 2000 at the British Library ![]() |
Peter May / Will Palmer, BL | 1 |
10.35 - 11.15 | Practical exercise Experiment with a pre-built environment to migrate TIFF to JPEG 2000. Delegates can bring their own images or use sample files. |
Rainer Schmidt (AIT), Roman Graf (AIT), Matthias Rella (AIT) Dave Tarrant, OPF |
2 |
11.15 - 11.30 | Coffee break | ||
11.30 - 12.45 | Migration and Quality Assurance Exploring migration and quality control tools for images and understanding how these are invoked on a single machine instance. Demonstration and practical exercise: How ImageMagick ![]() ![]() conversion. This exercise will be carried out in your own local instance and not built to scale. |
Sven Schlarb (ONB) Carl Wilson (OPF) |
2 |
12.45 - 13.30 | Lunch | ||
13.30 - 15.00 | Workflows With the tools explored, we will introduce workflows and look at how these can be used to invoke multiple operations to both migrate content and run quality control checks on the results. Demonstration and practical exercise: In this exercise we will create a simple quality-assured image file format migration workflow. Before starting the actual migration, we check if the TIF input images are valid file format instances using Fits ![]() ![]() hood). If the images are valid TIF images, we migrate them to the JPEG2000 (JP2) image file format using OpenJPEG 2.0 ![]() valid JP2 images using Jpylyzer ![]() are valid surrogates of the original TIF images by restoring the original TIF image from the converted JP2 image and comparing whether original and restored images are identical . Again this exercise will be carried out in your own local instance and not built to scale. Excercise: Follow the Workflows Exercise Worksheet. |
Sven Schlarb (ONB) |
3 |
15:00 - 15.15 | Coffee | |
|
15.15 - 16:30 | How to share your workflow Having built a workflow we look at how to share and discover other workflows. Practical exercise: Describe and upload workflows Presentation ![]() |
Donal Fellows (UNIMAN) |
4 |
16.30 - 17.00 | Wrap up | Dave Tarrant, OPF Rainer Schmidt, AIT |
|
17.00 | Close | ||
19.30 | Event dinner at The Betjeman Arms![]() |
Tuesday 17 September
Time | Session | Facilitator | Learning outcomes |
---|---|---|---|
09.00 - 09.15 | Coffee, welcome back and overview of agenda for the day | Dave Tarrant, OPF | |
09.15 - 10.15 | Introduction to preservation at scale![]() ![]() This session introduces the Hadoop platform introduces its application for executing preservation workflows in a distributed environment. More than just "getting the job done" we look at the tools for monitoring and controlling complex operations at scale and look at how these can be used to identify potential problems. |
Sven Schlarb, ONB Rainer Schmidt, AIT |
5 |
11.00 - 11.15 | Coffee | ||
11.15 - 12.30 | Building Scalable Environments continued Practical exercises Set up the Hadoop test installation and running an example which allows comparing local execution with (pseudo-distributed) cluster execution. Use Hadoop MapReduce for statistical result analysis. Excercise: Follow the File Format Identification and Result Analysis using Hadoop worksheet. |
Rainer/Graf/Rella (AIT) Sven Schlarb (ONB) |
5 6 |
12.30 - 13.30 | Lunch | ||
13.30 - 14.30 | Matchbox: Quality Control for digital collections![]() Invited talk: Introduction to the SCAPE repository reference implementation ![]() This talk will introduce the SCAPE repository reference implementation as a guide to get you started with. It will discuss the opportunities and potential for the future for scalability with respect to digital object management systems like Fedora 4. |
Roman Graf (AIT) Matthias Hahn (FIZ) |
2 7 |
14.45 - 15.00 | Coffee break | |
|
15.00 - 16.00 | Integrating Taverna and Hadoop This final session recaps the work that has been done to this point and allows attendees to fully integrate a number of workflows (both of their own making as well as existing ones) into scalable preservation platform on-site. Excercise: Combine the Hadoop jobs introduced in worksheet File Format Identification and Result Analysis using Hadoop using Taverna. |
Sven Schlarb (ONB) Rella, Schmidt Graf (AIT) |
5 |
16.00 - 17.00 | Panel and wrap up | Rainer Schmidt, AIT Dave Tarrant, OPF |
7 |
17.00 | Close | ||
Useful Links
- Amazon Map Reduce Services - http://aws.amazon.com/elasticmapreduce/mapr/
- Cloudera Manager, graphical management of your hadoop cluster - http://www.cloudera.com/content/cloudera/en/products/cloudera-manager.html
- Amazon + Cloudera - Up and running in 20 minutes - http://www.thecloudavenue.com/2013/04/setting-up-cdh-cluster-on-amazon-ec2-in.html
- Building a hadoop cluster on Raspberry PIs - http://blog.ittoby.com/2013/08/starting-small-set-up-hadoop-compute.html
Note that while the last link is specific to a network of Raspberry Pis, parts of the same guide can be used to setup your own hadoop cluster without requiring cloudera.
Labels:
None