|Title||The Tika Characterisation Tool|
|Detailed description||Tika have been chosen as a useful tool for the PC CC workpackage. This site details the information we have acquired about the tool|
| Solution Champion
||Asger Askov Blekinge, Markus Raditsch?, Peter May (BL)|
| Corresponding Issue(s)
|| A bulletted list of links to Issues to which this provides a Solution
|myExperiment Link|| Basic Apache TIKA 1.0 workflow
Simple Tika RESTful web service workflow
| Tool Registry Link
|| Any notes or links on how the solution performed. This will be developed and formalised by the Testbed SP.
Some initial evaluation of identification tools can be found here: SO3 Comparing identification tools
Tika does have the ability to run on multiple input files, as one process. Given the startup-cost of the JVM, this seems like an important performance boost. However, for this to be usable, Tika needs to be stable enough to run through the entire list, even if some of the files are unparsable.
This is not the case, at present. TODO, FINISH THIS
Some experimentation with large TIFF files results in Tika crashing when parsing due to lack of heap space. Simply adding -Xmx1024m (or even 512m) to the Java command line when running Tika solves this problem. Identification only should not be affected as this only reads in a small (8K) buffer from the file.
This does not rule out the case that other files may cause other (possibly parsing) issues that crash Tika.
The SCAPE github repository provides a toolspec for creating a Tika web service (https://github.com/openplanets/scape/tree/master/xa-toolwrapper/examples). This service can currently be used with Taverna Workflows however Tika control is via the Command Line Interface (CLI) and requires a local Tika installation. Instead, a RESTful web service could be hosted externally (to their local machine, e.g. on the web) to provide an accessible and scalable service for use in Taverna workflows, something like:
- GET Tika/rest/mime?file=<path_to_file>
- GET Tika/rest/parse?file=<path_to_file>
Source code can currently be found at https://github.com/pmay/SCAPE-Tika-REST. This wraps the OpenPlanets Tika fork (https://github.com/openplanets/tika) in a RESTful web service built using Jersey (http://jersey.java.net/)
- Formulate and document API
Updated the Tika signature file with additional mime-type XML fragments specifying PDF version in the ‘type’ attribute and the version number in the magic match value. This new fragment was also specified as a sub-class-of the generic PDF:
Minor updates were also required to the unit tests to account for the version being reported in the mime-type.
Currently only supports PDF versioning, so further work is needed to add versions to other mime-types.
Prototype of Tika (https://github.com/openplanets/tika) up and running that used regular expressions for identification. The <match> element for PDF-1.4 was modified to use the regular expression from Fido’s signature file, although this required a slight modification (‘\’ have to be escaped):
The Tika signature file is based on the Freedesktop MIME-info format (http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html). The <match> element’s ‘type’ attribute was extended to “regex”.
Magic patterns are stored in a TreeSet, sorted in descending order according to the following four rules (in precedence order):
- Magic Priority
- Maximum total length of the Magic pattern value (and all nested patterns)
- String comparison of MIME type
- String comparison based on the constructed string “[“”/”+String_representation_of_the_matching_clauses+”]”
String_representation_of_the_matching_clauses is a String made up of the MIME type, pattern length, pattern and any mask:
(nested clauses are AND’d and OR’d together where nested patterns are involved)
- Ordering of XML mime-type fragments in the signature file has no effect (because the TreeSet orders the patterns)
- Higher priority fragments are matched first
- Equal priority fragments match against the longer "match value" first (e.g. against “%PDF-1.4” match value first rather than “%PDF-”)
- Equal priority, equal “match value” length fragments match against the lexicographically greater MIME type first (e.g. “application/rtf” before “application/pdf”)