View Source

h1. Goals

To make some of the most commonly used digital preservation format identification and characterisation tools Hadoop friendly. The aim is not to re-invent the wheel, existing code will be used where possible, but some of these existing projects would benefit from better documentation and working used cases on a public continuous integration service.

The starting tool list is:

* File
* Tika
* ExifTool

One of the more general problems the group aims to address is the inefficiency of spawning native shell processes from Java. This has traditionally been the means of calling tools like file. This doesn't scale well for Hadoop processing, and existing mitigation strategies such as trying to batch shell processing, i.e. one call processed many files, aren't entirely satisfactory. The group will look at making JNA calls instead to make the process more efficient.

Another general issue is the lack of stream based processing for some tools. DROID, file, and ExifTool in particular require a file on which to operate rather than a stream. Again this is not efficient for many characterisation in situations where the content to be characterised is held within some kind of container format, e.g. zip or WARC files. In these cases it is more efficient to be able to read the stream from the container and characterise the stream, rather than having to serialise it to a new file.

A list of specific goals for each tool is presented below:

Performing DROID format identification on single streams is not always easy. Look at the nanite code, which addresses some these difficulties, document the use of nanite and provide an exemplar. The nanite code cannot currently implement container characterisation. The group would like to see if this is possible.

h2. File
File suffers from two inefficiency issues, the need to create a shell sub-process and the requirement to operate on a file instance.
h2. Tika
Tika is pure Java and provides a stream based API, shouldn't require additional work.
h2. Exiftool
Commandline invocation (called from FITS) and requires a file instance.
h2. FITS
Makes multiple command line calls on individual files, though is a Java application. Make more efficient by patching in fixes to the above tools.

h1. Results

Rather than directly working on Hadoop tasks the group has really worked on the enhancement of the [nanite |] and [FITS]( code bases.


So there has been some investi

h2. File
So we borrowed the JHOVE 2 JNA wrapper for file, removed the JHOVE dependencies and added one or two convenience methods. The initial code was placed into a [GitHub repository |]. For Java developers this offers the advantages of:

* A direct call to the libmagic library, avoiding the inefficiency of spawning a sub-process for magic identification.
* The possibility of performing stream based magic identification, meaning streams from container formats do not need serialising to temporary files.