2011-02-17 Dev8D Open Planets Foundation Challenge

Skip to end of metadata
Go to start of metadata

Contents

Introduction

The summary of this event is available here: http://dev8d.org/challenges.html#openplanets

See also

Description

To make digital media management and preservation easier, it is very useful to have tools that can identify file formats quickly. To this end, the National Archive created the DROID tool, which uses a list of format 'signatures' (e.g. magic numbers) from the PRONOM database in order to identify files. Recently, this has lead to the development of a more command-line friendly version of the same tool called Fido, which is written in Python and uses regular expressions (derived from the DROID signatures). These tools can identify particular versions of formats, and report those results in machine-interpretable ways by reporting unique identifiers that have been assigned to the different format types.

The main challenge is to improve the coverage and quality of the signature files for Fido by creating new signatures for formats that the current set does not cover, or by adding more information to the existing set. To ensure the quality of the results, we will specify the required information (see below) and will rate the submissions according to how complete they are. At the end of the day we should have a much better set of signatures and format information that we can submit to PRONOM or elsewhere. To improve the existing records, you will just need a basic understanding of XML, but to hack signatures, will need to be comfortable with regular expressions and the command-line (for testing).

A alternative and parallel challenge is to improve Fido itself (this would require Python expertise), or improve other identification tools in cunning or impressive ways.

Further details are supplied below, along with a few ideas to get you started.

Prizes

The prizes are as follows:

First Prize, Samsung Galaxy Tablet
Second Prize, Samsung Galaxy Player 50

The prizes will be awarded to the developers who create, update and/or review the most new signatures, or who make the most significant contribution to the development of format identification tools.

How To Take Part

To take part in the format record challenge, use the format record template from here:

https://github.com/openplanets/fido/blob/master/data/anjackson/format_extension_template.xml

or start with the existing, incomplete records found here:

https://github.com/openplanets/fido/blob/master/data/pronom/formats.xml

I've uploaded some examples (e.g. adding a reference document to the PNG 1.0 spec.) here:

https://github.com/openplanets/fido/tree/master/data/anjackson

https://github.com/openplanets/fido/blob/master/data/anjackson/format.png-1.0.xml

Having created or updated a record, the format signature can be tested using the Fido code. See here (https://github.com/openplanets/fido) for installation instructions, which also tell you how to use the '-loadformats' flag to test additional format identification records.

The new or updated format records will be ranked and scored as follows:

  • INCOMPLETE
    • SCORE 0: Information is insufficient for identification purposes.
  • STUB
    • SCORE 5: Suitable for identification, but containing no further descriptive information.
  • ADEQUATE
    • SCORE 7: Also contains basic descriptive information and at least one reference to relevant documentation or further information.
  • COMPLETE
    • SCORE 10: Also contains one or more example files under suitable license terms that can be used for testing. Ideally, CC0.

The Fido repository will also include the script used to score the entries.

...TBA...

python fido/fido.py -checkformats -loadformats data/anjackson/format_extension_template.xml -useformats fido-fmt/189.word

Submission

To submit you entries, either

To be made a member of the GitHub repository (for the duration of the challenge), or if you have any further questions, contact Andrew.Jackson [at] bl.uk.

Timeline

Ideas

Example ideas for new format records

  • Of the 736 PRONOM records, 591 of them are marked as 'outline' or 'in preparation'.
    • And 477 are incomplete, and can't even be used for identification (see figure on the right).
    • See this Google Spreadsheet for a breakdown.
  • We could look at the large file/libmagic signature set and look for important gaps, and port over the information.
  • Known gaps include:
    • Some Microsoft Office documents only identified as 'OLE Container' or similar, poor OOXML support (Transitional/Strict)
    • JAR files only recognised as containers (ZIP)
    • Plain text encodings, UTF-8 etc
    • Comma Separated Verbose (CSV)
    • Database files.
    • See this comment on the OPF blog.
    • ...

Example ideas for tool improvement:

  • Port Fido to Jython, and perhaps integrate with the Planets tool suite.
  • Add Fido identification to JHOVE2.
  • Update DROID for speed, compatibility, utility, etc.
  • Make the 'file' command (http://darwinsys.com/file/) more useful for digital preservation, e.g. by making it support PRONOM IDs.
  • Improve Percipio, which started life at the previous OPF Hackathon and can be used to generate signatures from files of known type, and integrate it with Fido.
  • Improve the data model for format information.
    • The current model is closely based on what is defined by PRONOM, but perhaps that's not the right approach.
    • Do we need Format Families?
    • Do we need conformsTo, isMemberOf, isChildOf, or other relationships between formats?
    • Can we make clear distinctions between families, formats, encodings and character sets? MIME uses "text/plain;charset=utf-8" - where does that fit?
    • How can we be clear about the distinction between format conformance hierarchies (a ODF is also a ZIP), (a.k.a. profiles (a PDF/A is a PDF1.4)?) and non-conformance hierarchies (a PNG 1.0 is also a PNG).
    • How much information do we really need to model formally - would a wiki be enough?
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.