View Source

h4. Overview

This first SCAPE training course focuses on one of the biggest initial challenges to digital preservation, *file format identification*. While there has been a lot of work in this area, the ever changing nature of digital formats realistically means the problem will never be "solved". This training course will give you the knowledge and experience to confidently choose a file formation identification and basic characterisation tools. 

Further to introducing the tools this training session will bring forward the expertise of the scape project on *integrating tools* into your own scalable systems and workflows.

With more businesses and organisations actively deploying preservation services, there is *critical need* for more knowledge to be shared and services to be developed in order to *inform of change*. Panels and open discussion sessions will provide a valuable *space for voices* to discuss the latest preservation services to *monitor change*.

With regards to file formation identification and characterisation there are a number of recent conference papers and blogs outlining the *on going challenge* and many of the writers of these resources will be at the event for open face to face discussion.

(becky list the iPres scape papers and blogs on file format issues from me/maurice etc etc)


h4. Who should attend?   

Digital preservation practitioners; digital librarians and archivists, digital curators, repository managers, or anyone with a responsibility to manage digital collections. To get the most out of this training course, attendees should ideally have some experience of command line interfaces and system architecture. 

Developers who actively deploy such systems are also encouraged to attend. 

h4. Learning Outcomes (by the end of the session the attendees will be able to:
{panel}

{panel}
# Distinguish between different file types and identify the requirements for characterising each of them.
# Carry out a number of identification, characterisation, and duplication detection experiments on example files.
# Critically evaluate characterisation and identification tools and assess their advantages and disadvantages when used in different scenarios.
# Compare and contrast the differences in running characterisation and identification tools both stand-alone and within workflows.
# Envisage a system that combines workflows with identification, characterisation and validation tools to suit a variety of scenarios.
# Conduct an in-depth analysis of large volumes of identification and characterisation data and find representative sample records suitable for preservation planning experiments.


h4. Agenda: 

h4. Thursday 6 December 



|| Time || Session || Facilitator || Learning outcomes ||
| 09.30 - 10.00 | Registration | | |
| 10.00 - 10.15 | Welcome and housekeeping | Miguel Ferreira, KEEPS | |
| 10.15 - 11.15 | *Introduction to file formats * \\
Understanding the different requirements for identification and characterisation experiments \\
\\
*File format identification and characterisation tools: file, droid, tika, exiftool* \\
What can they do? \\
File Format Identification, File Format Characterisation, \\
File Format Validation, File Format Signature Files \\ | \\
Carl Wilson, OPF \\
Asger Blekinge, SB \\
Dave Tarrant, OPF | 1 |
| 11.15 - 11.30 | Coffee | | |
| 11.30 - 12.45 | *Applying file format tools to different scenarios* (demonstrations) \\
How do they compare? | Carl Wilson, OPF  \\
Asger Blekinge, SB  \\
Dave Tarrant, OPF | 1 |
| 12.45 - 13.45 | Lunch | | |
| 13.45 - 15.15 | *Break out groups: practical exercises* \\
Creating file format profiles with an example dataset \\
Command line processing \\
\\
Evaluation of the results \\ | Carl Wilson, OPF \\
Asger Blekinge, SB \\
Dave Tarrant, OPF \\
\\ | 2 |
| 15:15 - 15.30 | Coffee | \\ | |
| 15.30 - 16:30 | *Wrapping tools for identification and characterisation* \\
FITS (File Information Tool Set)  \\
\\
*Panel session: advantages and disadvantages of wrapping tools* \\
Q&A | Petar Petrov, TUWIEN \\
\\
\\
All | 3 \\ |
| 16.30 - 17.00 | Wrap up | Dave Tarrant, OPF | |
| 17.00 | Close | | |
| 20.00 | Event dinner | | |

h4.


h4. Friday  7 December

|| Time || Session || Facilitator || Learning outcomes ||
| 09.00 - 09.10 | Welcome back, overview of agenda for the day | Dave Tarrant, OPF | |
| 09.10 - 10.15 | *Using file format identification tools as part of a workflow* \\
Introduction to Taverna workflows \\
Demonstration: Web archive content identification over ARC files \\
using tika in a Taverna workflow \\ | Sven Schlarb, ONB | 4 |
| 10.15 - 10.30 | Coffee | | |
| 10.30 - 11.45 | *Comparing the Taverna workflow with a DROID version of the workflow* \\
Introduction to file format identification using a Hadoop cluster (demonstration) \\
Understanding the implementation differences \\ | Sven Schlarb, ONB \\ | 4 \\ |
| 11.45 - 12.15 | Comparison of results | Sven Schlarb, ONB \\ | |
| 12.15 - 13.15 | Lunch | | |
| 13.15 - 13.45 | *Content profiling and planning* \\
Introduction and motivation of large-scale content profiling for preservation analysis \\ | Petar Petrov, TUWIEN \\ | 5 \\ |
| 13.45 - 14.15 | *Practical exercise:* analysing an example scenario file set without a content profiler \\
Discussion of results | Petar Petrov, TUWIEN \\ | 6 |
| 14.15 - 14.45 | *c3po* (A content profiling prototype) demonstration of the tool and its capabilities \\ | | |
| 14.45 - 15.15 | *Quality control for digital collections: the matchbox tool* \\
Identifying duplicate images in digital collections | Roman Graf, AIT | 4  \\ |
| 15.15 - 15.30 | Coffee | | |
| 15.30 - 16.30 | *Practical exercise:* analysing the scenario file set using c3po \\
Comparing the results and lessons learned | Petar Petrov, TUWIEN \\ | 6 \\ |
| 16.30 - 17.00 | Wrap up discussion and event evaluation | Dave Tarrant, OPF \\ | |
| 17.00 | Close | | |