Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History
Title The Tika Characterisation Tool
Detailed description Tika have been chosen as a useful tool for the PC CC workpackage. This site details the information we have acquired about the tool
Solution Champion
Asger Askov Blekinge, Markus Raditsch?, Peter May (BL)
Corresponding Issue(s)
A bulletted list of links to Issues to which this provides a Solution
myExperiment Link
_http://www.myexperiment.org/workflows/2583.html_
Tool Registry Link
http://wiki.opf-labs.org/display/TR/Tika
Evaluation
Any notes or links on how the solution performed. This will be developed and formalised by the Testbed SP.

Status

Instability, and its effect on performance

Tika does have the ability to run on multiple input files, as one process. Given the startup-cost of the JVM, this seems like an important performance boost. However, for this to be usable, Tika needs to be stable enough to run through the entire list, even if some of the files are unparsable. 

This is not the case, at present. TODO, FINISH THIS

Tika RESTful service and Taverna Workflow

The SCAPE github repository provides a toolspec for creating a Tika web service (https://github.com/openplanets/scape/tree/master/xa-toolwrapper/examples).  This service can currently be used with Taverna Workflows however Tika control is via the Command Line Interface (CLI) and requires a local Tika installation.  Instead, a RESTful web service could be hosted externally (to their local machine, e.g. on the web) to provide an accessible and scalable service for use in Taverna workflows, something like:

  • GET Tika/rest/mime?file=<path_to_file>
  • GET Tika/rest/parse?file=<path_to_file>

Source code can currently be found at https://github.com/pmay/SCAPE-Tika-REST.  This wraps the OpenPlanets Tika fork (https://github.com/openplanets/tika) in a RESTful web service built using Jersey (http://jersey.java.net/)

TODOs:

  1. Formulate and document API

Can Tika support extended/fine-grained mime types?

Updated the Tika signature file with additional mime-type XML fragments specifying PDF version in the ‘type’ attribute and the version number in the magic match value.  This new fragment was also specified as a sub-class-of the generic PDF:

Minor updates were also required to the unit tests to account for the version being reported in the mime-type. 

Currently only supports PDF versioning, so further work is needed to add versions to other mime-types.

Can Tika be extended to support regexp like Fido?

Prototype of Tika (https://github.com/openplanets/tika) up and running that used regular expressions for identification.  The <match> element for PDF-1.4 was modified to use the regular expression from Fido’s signature file, although this required a slight modification (‘\’ have to be escaped):

The Tika signature file is based on the Freedesktop MIME-info format (http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html).  The <match> element’s ‘type’ attribute was extended to “regex”.

Processing order of Mime-types in the Tika Signature file

Magic patterns are stored in a TreeSet, sorted in descending order according to the following four rules (in precedence order):

  1. Magic Priority
  2. Maximum total length of the Magic pattern value (and all nested patterns)
  3. String comparison of MIME type
  4. String comparison based on the constructed string “[“priority”/”+String_representation_of_the_matching_clauses+”]”

String_representation_of_the_matching_clauses is a String made up of the MIME type, pattern length, pattern and any mask:

(nested clauses are AND’d and OR’d together where nested patterns are involved)

Conclusions:
  • Ordering of XML mime-type fragments in the signature file has no effect (because the TreeSet orders the patterns)
  • Higher priority fragments are matched first
  • Equal priority fragments match against the longer "match value" first (e.g. against “%PDF-1.4” match value first rather than “%PDF-”)
  • Equal priority, equal “match value” length fragments match against the lexicographically greater MIME type first (e.g. “application/rtf” before “application/pdf”)
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.