Version 2 by Carl Wilson
on Dec 03, 2013 09:07.

compared with
Version 3 by Carl Wilson
on Dec 03, 2013 09:09.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
One of the more general problems the group aims to address is the inefficiency of spawning native shell processes from Java. This has traditionally been the means of calling tools like file. This doesn't scale well for Hadoop processing, and existing mitigation strategies such as trying to batch shell processing, i.e. one call processed many files, aren't entirely satisfactory. The group will look at making JNA calls instead to make the process more efficient.

Another general issue is the lack of stream based processing for some tools. DROID, file, and ExifTool in particular require a file on which to operate rather than a stream. Again this is not efficient for many characterisation in situations where the content to be characterised is held within some kind of container format, e.g. zip or WARC files. In these cases it is more efficient to be able to read the stream from the container and characterise the stream, rather than having to serialise it to a new file.

A list of specific goals for each tool is presented below: