View Source

h3. Event Evaluation Survey

Please complete the event evaluation survey at: [http://www.surveymonkey.com/s/TCZHBTD|http://www.surveymonkey.com/s/TCZHBTD]

h4. Aim

This training course will cover elements dealing with scalable identification, characterisation and validation of large collections of varying file types. Users will be introduced to a number of toolsdesigned for each of these purposes and involved in problem solving scenarios. Further, users will be required to evaluate the use of scalable and cloud based technologies in developing solutions for given scenarios.

h4. Learning outcomes (by the end of the training event attendees will be able to):

{panel}
# Distinguish between different file types and identify the requirements for characterising each of them.
# Carry out a number of identification, characterisation, and duplication detection experiments on example files.
# Critically evaluate characterisation and identification tools and assess their advantages and disadvantages when used in different scenarios.
# Compare and contrast the differences in running characterisation and identification tools both stand-alone and within workflows.
# Envisage a system that combines workflows with identification, characterisation and validation tools to suit a variety of scenarios.
# Conduct an in-depth analysis of large volumes of identification and characterisation data and find representative sample records suitable for preservation planning experiments.{panel}


h4. Thursday 6 December (the agenda is subject to change)



|| Time || Session || Facilitator || Learning outcomes ||
| 09.30 - 10.00 | Registration | | |
| 10.00 - 10.15 | Welcome and housekeeping | Miguel Ferreira, KEEPS | |
| 10.15 - 11.15 | *Introduction to file formats * \\
Understanding the different requirements for identification and characterisation experiments \\
\\
*File format identification and characterisation tools: file, fido, tika, exiftool* \\
What can they do? \\
File Format Identification, File Format Characterisation, \\
File Format Validation, File Format Signature Files \\ | \\
Carl Wilson, OPF \\
Dave Tarrant, OPF | 1 |
| 11.15 - 11.30 | Coffee | | |
| 11.30 - 12.45 | *Applying file format tools to different scenarios* (demonstrations) \\
How do they compare? | Carl Wilson, OPF  \\
Dave Tarrant, OPF | 1 |
| 12.45 - 13.45 | Lunch | | |
| 13.45 - 15.15 | *Break out groups: practical exercises* \\
Creating file format profiles with an example dataset \\
Command line processing \\
\\
Evaluation of the results \\ | Carl Wilson, OPF \\
Dave Tarrant, OPF \\
\\ | 2 |
| 15:15 - 15.30 | Coffee | \\ | |
| 15.30 - 16:30 | *Wrapping tools for identification and characterisation* \\
FITS (File Information Tool Set)  \\
\\
*Panel session: advantages and disadvantages of wrapping tools* \\
Q&A | Petar Petrov, TUWIEN \\
\\
\\
All | 3 \\ |
| 16.30 - 17.00 | Wrap up | Dave Tarrant, OPF | |
| 17.00 | Close | | |
| 20.00 | Event dinner | | |

h4. Open Feedback - Day 1

* Need to provide cheet sheets for each tool so people can choose their options and experiments they wish to run
* The datasets need to be, unzipped in the virtual machines ready for instant use.
* Need to focus a little closer on what data is output by the different tools for different formats.
* Need to get to the quickscripts that use some of the tools in more complex ways to produce summaries in excel.
* Need to put fits on the machines and allow people to look at a fits profile
* The discussion at the end of the day was effective, a good result. 


h4. Friday  7 December

|| Time || Session || Facilitator || Learning outcomes ||
| 09.15 - 09.30 | Welcome back, overview of agenda for the day | Dave Tarrant, OPF | |
| 09.30 - 10.15 | *Content profiling and planning*   \\
Introduction and motivation of large-scale content profiling for preservation analysis | Petar Petrov, TUWIEN \\ | 5 |
| 10.15 - 10.45 | *Practical exercise:* analysing an example scenario file set without a content profiler   \\
Discussion of results | Petar Petrov, TUWIEN \\ | 6 |
| 10.45 - 11.00 | Coffee | | |
| 11.00 - 11.30 | *c3po* (A content profiling prototype) demonstration of the tool and its capabilities \\ | Petar Petrov, TUWIEN | 6 |
| 11.30 - 12.00 | *Practical exercise:* analysing the scenario file set using c3po   \\
Comparing the results and lessons learned | Petar Petrov, TUWIEN | 6 |
| 12.00 - 12.30 | *Quality control for digital collections: the matchbox tool*   \\
Identifying duplicate images in digital collections | Roman Graf, AIT | 4 |
| 12.30 - 13.30 | Lunch and presentation of certificates | | |
| 13.45 - 15.15 | *Using file format identification tools as part of a workflow* \\
Introduction to Taverna workflows \\
Demonstration: Web archive content identification over ARC files \\
using tika in a Taverna workflow \\ | Sven Schlarb, ONB | 4 |
| 15.15 - 15.30 | Coffee | | |
| 15.30 - 16.30 | *Comparing the Taverna workflow with a DROID version of the workflow* \\
Introduction to file format identification using a Hadoop cluster (demonstration) \\
Understanding the implementation differences \\ | Sven Schlarb, ONB \\ | 4 \\ |
| 16.00 - 16.30 | Comparison of results | Sven Schlarb, ONB \\ | |
| 16.30 - 17.00 | Wrap up discussion and event evaluation | Dave Tarrant, OPF \\ | |
| 17.00 | Close | | |

h3. Open Feedback - Day 2

* Need to make sure data is pre-loaded into VMs. Also the contents of data.zip didn't seem to hold any significance, e.g. documents, images etc
* A recap was good, but needs to be more in line with day 1.
* c3po needs a bit of work to ensure it is running and it is clear how to find a collection. 
* The graphs and discussion on why the system is modular worked well. 
* Perhaps matchbox does have a practical exercise, maybe something to add to the machine images for next time. 

h4.



h3.