Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Goals

To make some of the most commonly used digital preservation format identification and characterisation tools Hadoop friendly. The aim is not to re-invent the wheel, existing code will be used where possible, but some of these existing projects would benefit from better documentation and working used cases on a public continuous integration service.

The starting tool list is:

  • DROID
  • File
  • Tika
  • ExifTool
  • FITS

One of the more general problems the group aims to address is the inefficiency of spawning native shell processes from Java. This has traditionally been the means of calling tools like file. This doesn't scale well for Hadoop processing, and existing mitigation strategies such as trying to batch shell processing, i.e. one call processed many files, aren't entirely satisfactory. The group will look at making JNA calls instead to make the process more efficient.

Another general issue is the lack of stream based processing for some tools. DROID, file, and ExifTool in particular require a file on which to operate rather than a stream. Again this is not efficient for many characterisation
A list of specific goals for each tool is presented below:

DROID

File

Tika

Exiftool

Working plan

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.