Tika Batch File Identification

Skip to end of metadata
Go to start of metadata
Title Tika Batch File Identification
Detailed description Overview:  
Group of issues surrounding batch processing of large number of files to identify file formats and therefore hint as to which applications may be useful for rendering the files.  

Various identified requirements:
- Processing of files within disk images, e.g. ISO images
- Recursive file identification through a directory
- Ability to handle renamed files, e.g. files with odd file extensions (.doc renamed to .tree)
- Extract metadata information from file (e.g. creation date, authors, etc.)
- Summarise file information within directory (e.g. number of files of each format, creation date ranges, etc.)
- Ease of use  

Solution:
Modular solution consists of 3* primary Python scripts to break the problem down into 3 blocks:

1. Recursively run Apache Tika over all files in a specified directory and save output metadata (TikaRunner.py)
2. Process output metadata and collate into a single CSV file (CSVFormatter.py)
3. Summarise the data in the CSV file (Summariser.py)

Each file's output from TikaRunner.py is stored in a same named file (with ".txt" extension appended) and contains JSON formatted metadata extracted by Tika.  These output files are stored in an (temporary) output top-level directory, maintaining the same directory structure as the original input directory.
CSVFormatter.py iterates over each of these JSON output files and aggregates all the data into one CSV file which can easily be opened in Excel.
Summariser.py summarises the aggregated results in the CSV file.


* A 4th python script provides configuration of the various components, e.g. path to the Tika JAR file.

Extended Solution (Processing ISO files):
One requirement not addressed (automatically, at least) is the processing of ISO file contents.  Since the event, the above solution has been extended with a 5th script (ISORunner.py) to iterate through a directory of ISO files, mounting each one using WinCDEmu, then processing the contents using TikaRunner and CSVFormatter.  The aggregated results from all ISO files is then summarised using a modified Summariser script.

Overall, the workflow looks something like:


ISO images:
The solution (currently) does not explicitly handle extraction of ISO image files using the scripts above. However, if an ISO is mounted, exposing the contained file system, then the developed scripts are able to operate over all the contained files, aggregating and summarising the results into the associated CSV files.

Mounting an ISO is not as trivial as on a Linux machine, nor can it be accomplished with Cygwin running on windows either. The latest version of FTK Imager (>ver 3) suposedly can mount ISO images to a drive letter, however I could not get this to work on the sample ISO images available.

I did find that the open-source WinCDEmu program was able to successfully mount the ISO samples and it also provides a command-line interface which will allow a script to automate mounting and processing of an ISO image. With an ISO image mounted to a drive letter (e.g. V:/), then the developed scripts worked as expected. 

Dependencies:
Requires installation of Java JDK 6 and Python 2.7.3 (the .3 is important - we've encountered problems with the scripts running on earlier Python versions) 

Installation and Running:
See README
Further information
See this blog post by Peter May that provides further detail and some context to this solution.
Solution Champion Peter May
Corresponding Issue(s)
Tool/code link https://github.com/openplanets/SPRUCE/tree/master/TikaFileIdentifier
Tool Registry Link
Evaluation Issues and areas for Improvement (from Solution Champion):
- Script to handle automated batch processing of ISO files
- Be useful to have a mapping between file formats and useful applications for opening the files with
- Processing speeds could be improved through parallel processing of files and only instantiating Tika once, rather than once per file.
- Some files cause Tika to crash whilst parsing them.  Needs further investigation and feedback to Apache Tika.
- Some files are only identified as application/octet-stream (Tika default).  Needs further investigation and feedback to Tika.
- Some problems with character encoding of metadata returned by Tika causing issues when trying to load JSON output files.

From Paul and reporting back session:
  • Script wraps Tika in order recurse through directory of files
  • Modularised approach: ID files, then visualise the results for each file and then a third script to provide summaries
  • CSV file output easy to review
  • Provides summary information
  • A few issues to resolve with odd files, but deals successfully with problematic .psd
  • Issue owner: Saves lots of manual effort. Deals with unusual file extensions well. Additional metadata is added bonus!
Labels:
identification identification Delete
spruce spruce Delete
spruce_glasgow spruce_glasgow Delete
solution solution Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.