Tika Batch File Identification

compared with
Current by Peter May
on Jun 13, 2012 09:19.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (9)

View Page History
Summariser.py summarises the aggregated results in the CSV file. \\
\\ !BasicWorkflow.jpg|align=center!\\
\* A 4th python script provides configuration of the various components, e.g. path to the Tika JAR file. \\
*{_}Extended Solution (Processing ISO files):_* \\
One requirement not addressed (automatically, at least) is the processing of ISO file contents.  Since the event, the above solution has been extended with a 5th script (ISORunner.py) to iterate through a directory of ISO files, mounting each one using [WinCDEmu|http://wincdemu.sysprogs.org/], then processing the contents using TikaRunner and CSVFormatter.  The aggregated results from all ISO files is then summarised using a modified Summariser script. \\
Overall, the workflow looks something like: \\
\\ !ISOWorkFlow.jpg|align=center!\\
*{_}ISO images:_* \\
The solution (currently) does not explicitly handle extraction of ISO image files using the scripts above. However, if an ISO is mounted, exposing the contained file system, then the developed scripts are able to operate over all the contained files, aggregating and summarising the results into the associated CSV files. \\
*{_}Dependencies:_* \\
Requires installation of Java JDK 6 and Python 2.7 \\
Requires installation of Java JDK 6 and Python 2.7.3 (the .3 is important - we've encountered problems with the scripts running on earlier Python versions)  \\
*{_}Installation and Running:_* \\