View Source

| *Title* | _Tika Batch File Identification_ \\ |
| *Detailed description* | *{_}Overview:_*   \\
Group of issues surrounding batch processing of large number of files to identify file formats and therefore hint as to which applications may be useful for rendering the files.   \\ \\
Various identified requirements: \\
\- Processing of files within disk images, e.g. ISO images \\
\- Recursive file identification through a directory \\
\- Ability to handle renamed files, e.g. files with odd file extensions (.doc renamed to .tree) \\
\- Extract metadata information from file (e.g. creation date, authors, etc.) \\
\- Summarise file information within directory (e.g. number of files of each format, creation date ranges, etc.) \\
\- Ease of use   \\
\\
*{_}Solution:_* \\
Modular solution consists of 3\* primary Python scripts to break the problem down into 3 blocks: \\
\\
1. Recursively run [Apache Tika|http://wiki.opf-labs.org/display/TR/Tika] over all files in a specified directory and save output metadata (TikaRunner.py) \\
2. Process output metadata and collate into a single CSV file (CSVFormatter.py) \\
3. Summarise the data in the CSV file (Summariser.py) \\
\\
Each file's output from TikaRunner.py is stored in a same named file (with ".txt" extension appended) and contains JSON formatted metadata extracted by Tika.  These output files are stored in an (temporary) output top-level directory, maintaining the same directory structure as the original input directory. \\
CSVFormatter.py iterates over each of these JSON output files and aggregates all the data into one CSV file which can easily be opened in Excel. \\
Summariser.py summarises the aggregated results in the CSV file. \\
\\ !BasicWorkflow.jpg|align=center!\\
\* A 4th python script provides configuration of the various components, e.g. path to the Tika JAR file. \\
\\
*{_}Extended Solution (Processing ISO files):_* \\
One requirement not addressed (automatically, at least) is the processing of ISO file contents.  Since the event, the above solution has been extended with a 5th script (ISORunner.py) to iterate through a directory of ISO files, mounting each one using [WinCDEmu|http://wincdemu.sysprogs.org/], then processing the contents using TikaRunner and CSVFormatter.  The aggregated results from all ISO files is then summarised using a modified Summariser script. \\
\\
Overall, the workflow looks something like: \\
\\ !ISOWorkFlow.jpg|align=center!\\
*{_}ISO images:_* \\
The solution (currently) does not explicitly handle extraction of ISO image files using the scripts above. However, if an ISO is mounted, exposing the contained file system, then the developed scripts are able to operate over all the contained files, aggregating and summarising the results into the associated CSV files. \\
\\
Mounting an ISO is not as trivial as on a Linux machine, nor can it be accomplished with Cygwin running on windows either. The latest version of FTK Imager (>ver 3) suposedly can mount ISO images to a drive letter, however I could not get this to work on the sample ISO images available. \\
\\
I did find that the open-source [WinCDEmu|http://wincdemu.sysprogs.org/] program was able to successfully mount the ISO samples and it also provides a command-line interface which will allow a script to automate mounting and processing of an ISO image. With an ISO image mounted to a drive letter (e.g. V:/), then the developed scripts worked as expected.  \\
\\
*{_}Dependencies:_* \\
Requires installation of Java JDK 6 and Python 2.7.3 (the .3 is important - we've encountered problems with the scripts running on earlier Python versions)  \\
\\
*{_}Installation and Running:_* \\
See [README|https://github.com/openplanets/SPRUCE/tree/master/TikaFileIdentifier]\\
*{_}Further information{_}* \\
See this [blog post|http://openplanetsfoundation.org/blogs/2012-04-19-spruce-mashup-batch-file-identification-using-apache-tika] by [~pmay] that provides further detail and some context to this solution. |
| *Solution Champion* | ___[~pmay]_ \\ |
| *Corresponding Issue(s)* | * [http://wiki.opf-labs.org/display/SPR/Identification+of+file+format+and+last+modified+or+created+dates+of+files+within+a+disk+image]
* [http://wiki.opf-labs.org/display/SPR/Identification+of+file+formats+with+incorrect+file+extensions]
* [http://wiki.opf-labs.org/display/SPR/Classification+of+files+within+a+disk+image] |
| *Tool/code link* | _[https://github.com/openplanets/SPRUCE/tree/master/TikaFileIdentifier]_ \\ |
| *[Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]* | * [http://wiki.opf-labs.org/display/TR/Tika] |
| *Evaluation* | *{_}Issues and areas for Improvement (from Solution Champion):_* \\
\- Script to handle automated batch processing of ISO files \\
\- Be useful to have a mapping between file formats and useful applications for opening the files with \\
\- Processing speeds could be improved through parallel processing of files and only instantiating Tika once, rather than once per file. \\
\- Some files cause Tika to crash whilst parsing them.  Needs further investigation and feedback to Apache Tika. \\
\- Some files are only identified as application/octet-stream (Tika default).  Needs further investigation and feedback to Tika. \\
\- Some problems with character encoding of metadata returned by Tika causing issues when trying to load JSON output files. \\
\\
From Paul and reporting back session: \\
* Script wraps Tika in order recurse through directory of files
* Modularised approach: ID files, then visualise the results for each file and then a third script to provide summaries
* CSV file output easy to review
* Provides summary information
* A few issues to resolve with odd files, but deals successfully with problematic .psd
* Issue owner: Saves lots of manual effort. Deals with unusual file extensions well. Additional metadata is added bonus\! |