
The SCAPE github repository provides a toolspec for creating a Tika web service ([https://github.com/openplanets/scape/tree/master/xa-toolwrapper/examples]). This service can currently be used with Taverna Workflows however Tika control is via the Command Line Interface (CLI) and requires a local Tika installation. Instead, a RESTful web service could be hosted externally (to their local machine, e.g. on the web) to provide an accessible and scalable service for use in Taverna workflows, something like:
* GET Tika/rest/mime?file=<path_to_file>
* GET Tika/rest/parse?file=<path_to_file>
Source code can currently be found at [https://github.com/pmay/SCAPE-Tika-REST.|https://github.com/pmay/SCAPE-Tika-REST] This wraps the OpenPlanets Tika fork ([https://github.com/openplanets/tika]) in a RESTful web service built using Jersey ([http://jersey.java.net/])
TODOs:
# Formulate and document API
h4. Can Tika support extended/fine-grained mime types?
Updated the Tika signature file with additional mime-type XML fragments specifying PDF version in the ‘type’ attribute and the version number in the magic match value. This new fragment was also specified as a sub-class-of the generic PDF:
{code}
<mime-type type="application/pdf; version=1.0">
<acronym>PDF 1.0</acronym>
<_comment>Portable Document Format - Version 1.0</_comment>
<sub-class-of type="application/pdf"/>
<magic priority="50">
<match value="%PDF-1.0" type="string" offset="0"/>
</magic>
</mime-type>
{code}
Minor updates were also required to the unit tests to account for the version being reported in the mime-type.
Currently only supports PDF versioning, so further work is needed to add versions to other mime-types.
h4. Can Tika be extended to support regexp like Fido?
Prototype of Tika ([https://github.com/openplanets/tika|https://github.com/openplanets/tika]) up and running that used regular expressions for identification. The <match> element for PDF-1.4 was modified to use the regular expression from Fido’s signature file, although this required a slight modification (‘\’ have to be escaped):
{code}
<mime-type type="application/pdf; version=1.0">
<acronym>PDF 1.0</acronym>
<_comment>Portable Document Format - Version 1.0</_comment>
<sub-class-of type="application/pdf"/>
<magic priority="55">
<match value="(?s)\\A.{0,144}%PDF-1\\.0" type="regex" offset="0"/>
</magic>
</mime-type>
{code}
The Tika signature file is based on the Freedesktop MIME-info format ([http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html|http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html]). The <match> element’s ‘type’ attribute was extended to “regex”.
h4. Processing order of Mime-types in the Tika Signature file
Magic patterns are stored in a [TreeSet|http://docs.oracle.com/javase/6/docs/api/java/util/TreeSet.html], sorted in *descending* order according to the following four rules (in precedence order):
# Magic Priority
# Maximum total length of the Magic pattern value (and all nested patterns)
# String comparison of MIME type
# String comparison based on the constructed string “\[“+priority+”/”\+_String_representation_of_the_matching_clauses_\+”\]”
_String_representation_of_the_matching_clauses_ is a String made up of the MIME type, pattern length, pattern and any mask:
{code}
"Magic Detection for " + type.toString() +
" looking for " + pattern.length +
" bytes = " + this.pattern +
" mask = " + this.mask;
{code}
(nested clauses are AND’d and OR’d together where nested patterns are involved)
h6. Conclusions:
* Ordering of XML mime-type fragments in the signature file has no effect (because the TreeSet orders the patterns)
* Higher priority fragments are matched first
* Equal priority fragments match against the longer "match value" first (e.g. against “%PDF-1.4” match value first rather than “%PDF-”)
* Equal priority, equal “match value” length fragments match against the lexicographically *greater* MIME type first (e.g. “application/rtf” before “application/pdf”)