Skip to end of metadata
Go to start of metadata

Thursday 6 December - Friday 7 December

See the session plan example and agenda structure for guidance.

Learning Outcomes (by the end of the session the attendees will be able to:

  1. Distinguish between different file types and identify the requirements for characterising each of them.
  2. Carry out a number of identification, characterisation, and duplication detection experiments on example files.
  3. Critically evaluate characterisation and identification tools and assess their advantages and disadvantages when used in different scenarios.
  4. Compare and contrast the differences in running characterisation and identification tools both stand-alone and within workflows.
  5. Envisage a system that combines workflows with identification, characterisation and validation tools to suit a variety of scenarios.
  6. Conduct an in-depth analysis of large volumes of identification and characterisation data and find representative sample records suitable for preservation planning experiments.

Session Plans:

6 December 

Session One: 

Learning outcomes:

Distinguish between different file types and identify the requirements for characterising each.

Carry out a number of identification and characterisation experiments on a number of example files.

Carry out a number of duplicate detection experiments on a small number of sample data

Time Outline Plan/Teacher Activity Attendee Activity Resources Speakers/Trainers
09.30 - 11.30


Set up the environment to run the tools
 - can attendees use the environment/tools

Introduce file formats

Introduce file format tools: file, droid, tika, exiftool 
- what can they do?

File Format Identification
File Format Characterisation
File Format Validation
File Format Signature Files

Basic identification experiments 

Running tools on sample files

Adding signatures to various tools


Access to machine running all required tools. 
VM, Taverna 
CW, RG, AB
11.45 - 12.45
Demonstrations of the tools

Scenario - file set?
Command Line processing?
Presentation of the matchbox tool (10 – 15 Min)
  Beamer
CW, RG, AB 
13.45 - 15.15
Introduce practical exercises

Create File Format profiles of a dataset using the various tools previously introduced.

Command Line processing - prepared scripts?

Complexity of processing files using tools. Consistency (or lack of) in tool output

Demonstration of matchbox tool with  practical exercises (10 – 15 Min –
analysis of the tool results for further processing or decision making)
Practical exercises / group work using the tools
Matchbox: complete some workflows for
a) image duplicate search,
b) content-based image comparison,
c) customize duplicate search workflow,
d) understand and describe outputs of different commands
Beamer
CW, RG, AB 

Session Two:

Learning outcomes:

Critically evaluate a number of characterisation tools and advantages and disadvantages in different scenarios.

Time Outline Plan/Teacher Activity Attendee Activity Resources Speakers/Trainers
15.30 - 17.30 Panel session: Introduce FITS and tool wrapping. 

Discussion: What are the advantages and disadvantages of wrapping tools

 
PP, CW, DT

7 December

Session Three: Workflows

Learning outcomes:

Compare and contrast the differences in running a number of characterisation tools both stand-alone and within taverna workflows

Envisage a system that combines workflow with identification, characterisation and validation tools to suit a number of scenarios

Time Outline Plan/Teacher Activity Attendee Activity Resources Speakers/Trainers
09.15 - 11.00 Abstract:
The attendee will get an insight how to do basic web archive content identification using file identification tools like Apache TIKA embedded in a Taverna workflow on one hand and on a Hadoop cluster on the other hand.

Topics:
  • Introduction & demo: running TIKA over ARC files using a Taverna workflow; show a DROID version of the workflow; compare results; show implementation details and implementation differences
  • Short intro: Hadoop; map / reduce
  • Introduction: wrapping TIKA in a map/reduce application using the TIKA API
  • Demo: run the map / reduce application over ARC files
  • Compare performance using the results from the ONB experimental cluster
  • show implementation details (code, tools, infrastructure) if needed and time permits
  Required skills:
  • No Taverna / Hadoop knowledge is needed to follow and understand the concepts of both approaches.
  • Moderate JAVA knowledge is needed to follow the implementation details.
  • Beamer
  • VM image containing Taverna workbench; TIKA / DROID workflow; ARC.GZ sample files
  • Optional (if you want to let the attendees to play around with prepared Hadoop applications or to let them try to write own programs): VM image containing a Hadoop cluster installation in pseudo-distributed mode & e.g. NetBeans IDE.
SS
11.15 - 12.30 Practical session?      

Session Four: Content profiling and planning

Learning outcomes:

Conduct in-depth analysis over large volumes of identification & characterisation data and find representative sample records suitable for preservation planning experiments 

Time Outline Plan/Teacher Activity Attendee Activity Resources Speakers/Trainers
13.30 - 14.00 Introduction and motivation of large-scale content profiling
for preservation analysis.
  Beamer PP/CB
14.00 - 14.15 Presentation of a scenario containing a (small) set of heterogeneous files (identification + characterisation data may be included as well). Make familiar with content set.
No required skills
VM with files (e.g. part of govdocs) + characterisation tools + FITS ~
14.15 - 14.45 Analysis by the attendees of the given set without content profiler. Any tools can be used in order to obtain an overview of the content at hand. This may included any presented identification and characterisation tools so
far, but also any other tool or combination of tools.
Obtain overview of the content and 
find representative samples
Knowledge about simple CLI tools, may be of help, but not necessary.
VM with files (e.g. part of govdocs) + characterisation tools + FITS ~
14.45 - 15.00 Discussion of the results and problems that occurred. What went well, What went not so well? Discussion/Presentation of results Beamer, Flipchart/Whiteboard? ~
15.00 - 15.30 Presentation of c3po (A content profiling prototype) and demonstration of the tool and its capabilities   Beamer ~
15.30 - 16.00 Analysis by the attendees of the same data as before (with c3po). Obtain overview of the content and find representative samples, as well as interesting facts about the content
VM with files (e.g. part of govdocs) + characterisation tools + FITS + c3po? ~
16.00 - 16.30 Discussion and comparison of the results with the previous iteration + Lessons Learned Discussion/Presentation of results Beamer ~
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.