Thursday 6 December - Friday 7 December
See the session plan example and agenda structure for guidance.
Learning Outcomes (by the end of the session the attendees will be able to:
- Distinguish between different file types and identify the requirements for characterising each of them.
- Carry out a number of identification, characterisation, and duplication detection experiments on example files.
- Critically evaluate characterisation and identification tools and assess their advantages and disadvantages when used in different scenarios.
- Compare and contrast the differences in running characterisation and identification tools both stand-alone and within workflows.
- Envisage a system that combines workflows with identification, characterisation and validation tools to suit a variety of scenarios.
- Conduct an in-depth analysis of large volumes of identification and characterisation data and find representative sample records suitable for preservation planning experiments.
Session Plans:
6 December
Session One:
Learning outcomes:
Distinguish between different file types and identify the requirements for characterising each.
Carry out a number of identification and characterisation experiments on a number of example files.
Carry out a number of duplicate detection experiments on a small number of sample data
Time | Outline Plan/Teacher Activity | Attendee Activity | Resources | Speakers/Trainers |
---|---|---|---|---|
09.30 - 11.30 |
Set up the environment to run the tools - can attendees use the environment/tools Introduce file formats Introduce file format tools: file, droid, tika, exiftool - what can they do? File Format Identification File Format Characterisation File Format Validation File Format Signature Files |
Basic identification experiments Running tools on sample files Adding signatures to various tools |
Access to machine running all required tools. VM, Taverna |
CW, RG, AB |
11.45 - 12.45 |
Demonstrations of the tools Scenario - file set? Command Line processing? Presentation of the matchbox tool (10 – 15 Min) |
Beamer |
CW, RG, AB |
|
13.45 - 15.15 |
Introduce practical exercises Create File Format profiles of a dataset using the various tools previously introduced. Command Line processing - prepared scripts? Complexity of processing files using tools. Consistency (or lack of) in tool output Demonstration of matchbox tool with practical exercises (10 – 15 Min – analysis of the tool results for further processing or decision making) |
Practical exercises / group work using the tools Matchbox: complete some workflows for a) image duplicate search, b) content-based image comparison, c) customize duplicate search workflow, d) understand and describe outputs of different commands |
Beamer |
CW, RG, AB |
Session Two:
Learning outcomes:
Critically evaluate a number of characterisation tools and advantages and disadvantages in different scenarios.
Time | Outline Plan/Teacher Activity | Attendee Activity | Resources | Speakers/Trainers |
---|---|---|---|---|
15.30 - 17.30 | Panel session: Introduce FITS and tool wrapping. Discussion: What are the advantages and disadvantages of wrapping tools |
|
|
PP, CW, DT |
7 December
Session Three: Workflows
Learning outcomes:
Compare and contrast the differences in running a number of characterisation tools both stand-alone and within taverna workflows
Envisage a system that combines workflow with identification, characterisation and validation tools to suit a number of scenarios
Time | Outline Plan/Teacher Activity | Attendee Activity | Resources | Speakers/Trainers |
---|---|---|---|---|
09.15 - 11.00 | Abstract: The attendee will get an insight how to do basic web archive content identification using file identification tools like Apache TIKA embedded in a Taverna workflow on one hand and on a Hadoop cluster on the other hand. Topics:
|
Required skills:
|
|
SS |
11.15 - 12.30 | Practical session? |
Session Four: Content profiling and planning
Learning outcomes:
Conduct in-depth analysis over large volumes of identification & characterisation data and find representative sample records suitable for preservation planning experiments
Time | Outline Plan/Teacher Activity | Attendee Activity | Resources | Speakers/Trainers |
---|---|---|---|---|
13.30 - 14.00 | Introduction and motivation of large-scale content profiling for preservation analysis. |
Beamer | PP/CB | |
14.00 - 14.15 | Presentation of a scenario containing a (small) set of heterogeneous files (identification + characterisation data may be included as well). | Make familiar with content set. No required skills |
VM with files (e.g. part of govdocs) + characterisation tools + FITS | ~ |
14.15 - 14.45 | Analysis by the attendees of the given set without content profiler. Any tools can be used in order to obtain an overview of the content at hand. This may included any presented identification and characterisation tools so far, but also any other tool or combination of tools. |
Obtain overview of the content and find representative samples Knowledge about simple CLI tools, may be of help, but not necessary. |
VM with files (e.g. part of govdocs) + characterisation tools + FITS | ~ |
14.45 - 15.00 | Discussion of the results and problems that occurred. What went well, What went not so well? | Discussion/Presentation of results | Beamer, Flipchart/Whiteboard? | ~ |
15.00 - 15.30 | Presentation of c3po (A content profiling prototype) and demonstration of the tool and its capabilities | Beamer | ~ | |
15.30 - 16.00 | Analysis by the attendees of the same data as before (with c3po). | Obtain overview of the content and find representative samples, as well as interesting facts about the content |
VM with files (e.g. part of govdocs) + characterisation tools + FITS + c3po? | ~ |
16.00 - 16.30 | Discussion and comparison of the results with the previous iteration + Lessons Learned | Discussion/Presentation of results | Beamer | ~ |