Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History


To make some of the most commonly used digital preservation format identification and characterisation tools Hadoop friendly. The aim is not to re-invent the wheel, existing code will be used where possible, but some of these existing projects would benefit from better documentation and working used cases on a public continuous integration service.

The starting tool list is:

  • File
  • Tika
  • ExifTool
  • FITS

One of the more general problems the group aims to address is the inefficiency of spawning native shell processes from Java. This has traditionally been the means of calling tools like file. This doesn't scale well for Hadoop processing, and existing mitigation strategies such as trying to batch shell processing, i.e. one call processed many files, aren't entirely satisfactory. The group will look at making JNA calls instead to make the process more efficient.

Another general issue is the lack of stream based processing for some tools. DROID, file, and ExifTool in particular require a file on which to operate rather than a stream. Again this is not efficient for many characterisation in situations where the content to be characterised is held within some kind of container format, e.g. zip or WARC files. In these cases it is more efficient to be able to read the stream from the container and characterise the stream, rather than having to serialise it to a new file.

A list of specific goals for each tool is presented below:





Working plan

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.