2011-08-31 OPF Cologne Hackathon

Skip to end of metadata
Go to start of metadata

To sign up for a wiki account visit: http://jira.opf-labs.org/secure/Signup!default.jspa|http://jira.opf-labs.org/secure/Signup!default.jspa

Below is our preliminary agenda for the OPF Cologne Hackathon. In order to meet our members' needs, we invite you to comment and provide feedback on the topics that you would like to be addressed at this event.

For information about getting there, hotels and the event dinner visit: http://www.openplanetsfoundation.org/events/2011-08-31-opf-hackathon

Register at: http://opf-hackathon-cologne.eventbrite.com/

AGENDA

31 August: Coffee and registration from 9.30am. Sessions 10am - 4pm (Event dinner 7.30pm)

Morning

Welcome from Manfred Thaller

Introduction from Bram van der Werf

Demo of remote emulation and automation - Dirk von Suchodoletz and Isgandar Valizada, University of Freiburg)

11.15 - 11.30 Coffee break

Emulation continued. Discussion and outlining use cases

13.00 - 14.00 Lunch

Afternoon

File identification tools - Maurice de Rooij

The session this afternoon will look at many of the available file identification tools, their features and usage. The main differences between each of the tools will be outlined, including performance and scalability aspects. The main outcome of this session will be to provide a greater knowledge on the different tools and find out why each one is different even though the core functionality remains the same.  

The second session here will begin with a recap of the tools current support level and emphasise the importance of this in a changing environment of file types. This session will be more open to people investigating the tools themselves and then discussing the relative merits of each. Finally a discussion will be held around the requirements and shortcomings of each tool which should lead to short term actions for the OPF community.

15.15 - 15.30 Coffee break 

The OPF-REF

This session will build on the mornings file identification sessions and introduce the OPF-Results Evaluation Framework. The OPF-REF has been designed to objectively compare preservation tools against a large corpus of test data. All results will also be published in raw form according to Linked-Data guidelines thus allowing easy re-processing and further experiments to be carried out. 

Late Afternoon Session

This is an open session focussing on the subjects of the first day. There will be opportunities here for ad-hoc focus groups to form and chat/hack on the REF and file evaluation tools. 

16.00 Close

19.30 Event dinner

1 September: 9am - 5pm

09.00 - 09.15 Coffee

09.15 Morning session

Testbed Emulation demo - Isgandar Valizada, University of Freiburg

KEEP presentation - Antonio Ciuffreda, University of Portsmouth

11.15 - 11.30 Coffee break

Developers - Hacking (OPF REF) facilitated by David Tarrant, or Emulation facilitated by Dirk/Isgandar

Practitioners - Use cases and requirements session facilitated by Bram van der Werf

13.00 - 14.00 Lunch

14.00 Afternoon session

Database Archiving

Demo of the Danish State Archive of a SIARD implementation

(15.00 - 15.15 Coffee break)

15:15-17:00 Break-out sessions hacking (REF & Emulation Service), requirement discussions

17.00 Close

2 September: 9am - 4pm

09.00 - 09.15 Coffee

09.15 Morning session

Plato Demonstation - Carl Wilson

This mornings sessions will focus on preservation planning, starting with a demonstration of the plato tool. Over the years the Plato tool has been developed substantially as part of many research projects and, as such, has become the most advanced and perhaps, most complex, tool available for preservation planning. The idea of this first session is to introduce the key stages of the Plato tool to give people a grasp on the important aspects of preservation planning. 

11.00 - 11.15 Coffee break

Discussion and hacking on XML schema's

12.30 - 13.30 Lunch

Afternoon

Session 2 will introduce the complexity of the plato tool and address some of the maintainability issues based upon experience within the OPF project. This will lead onto a discussion focussing on how plato does not have to be used for every stage of the planning process. The XML preservation plans mean that plato can inter-operate with other tools which can be developed on a smaller scale in order to allow easier maintenance.

14.45 - 15.00 Coffee break

Late afternoon

Closing thoughts and Panel

Subject to time, this last session will allow more open discussion/hacking followed by a short closing session. 

16.00 Close

Requirements & Hack Session: Results Evaluation Framework.

  • DROID
  • FIDO
  • File Utility
  • Automated execution of format identification tools against corpora
  • Storage of results in a consistent and suitable form
  • A way of comparing results, this can just be comparing a tool against itself to start with (i.e. DROID 6 with DROID 4).  .
  • A corpus of old results, as much as possible, i.e. during development we might start to gather results for old versions of the software, and supporting sig / magic files.
  • Results not just identifiable by tool and version, but also by supporting data, i.e. NOT DROID v4, but DROID v4 SigFile v24.

The hacking challenge is to first put a framework in place that can achieve the above.

Further requirements from the techies hacking, and the content owners.

Requirements & Roadmap for Characterisation Tools

Broadly, where should we put the available effort when it comes to characterising content?

  • Should we extend XCL, or focus on JHOVE2?
    • Which platform do we have the most expertise with (i.e. is the Qt/C++ that XCL depends on a problem?)
    • Can we merge the two efforts?
  • Where does FITS fit in? Should we even be writing out own tools when we have alternative that others use, e.g. Apache Tika.

Tools to be aware of:

  • Developed by the preservation community:
    • DROID
    • Fido
    • FITS
    • JHOVE
    • JHOVE2
    • XCL Tools (from Planets)
    • NZ Metadata Extractor
  • Developed by others
    • file
    • JMimeMagic (a Java implementation covering a good chunk of file's functionality)
    • Apache Tika
    • And many more, for specific media types, e.g. Apache PDFBox.

Some proposed requirements

  • Identify single bitstreams
  • Identify container bitstreams and go inside.
  • Identify aggregate objects (n bitstreams in an arrangement)
  • Identify non-bitstream objects (folders, URIs, NULL?)
  • Identify XML, it's root schema, namespace, encoding, and encodings on the inside.
  • Combine identification tools and resolve the differences (FITS?) 
  • Extract summary properties that describe the object encoding (FITS?) 
  • Perform deep analysis of significant characteristics (JHOVE2)

Database Archiving Requirements

PLATO demo

A demonstration of the plato tool. Over the years the Plato tool has been developed substantially as part of many research projects and, as such, has become the most advanced and perhaps, most complex, tool available for preservation planning. The idea of this first session is to introduce the key stages of the Plato tool to give people a grasp on the important aspects of preservation planning. 

Format identification tools (DROID and file) training

Does this match with the expected audience?

  • History of the tools and where to obtain them
  • Installation
  • How to run the tools
  • Command line tools training, piping of input, redirection of error and output, using batch files
  • What to do with the results
  • Where the REF fits in

Emulation Demo

Demo of remote emulation and automation (University of Freiburg)

The first 3 demos are the vncrecord, vncreplay and qemu web-services deployed within planets. They implement the CreateView interface. In each case they will be consumed through the JSP-web pages, which we will access through the web-browser.

  1. By using vncrecord we will demonstrate the ability to record user input-events performed on an emulated  os image. As for example, the recording of user session, which accesses the injected obsolete DO using corresponding application pre-installed in the chosen OS image.
  2. At the second step the vncreplay service will reproduce these input events (involving the same conditions, but different DOs) in an unattended way as if they were performed by the user.
  3. The qemu service is only  for demonstrating the possibility of remote emulation (similar to our earlier GRATE, but now planets compatible).

The last demo is the Migrate implementing service also deployed within planets: GRATE-R. This service accepts a DO and uses a prerecorded user sessions to convert the DO to a format of interest. Currently we support  SAM >PDF +TXT format migration and WPD > RTF migration.  The testing will be done through the Testbed by assigning the corresponding experiment and connecting through RFB protocol in order to see the conversion in real-time.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.