Extracting and aggregating metadata with Apache Tika

Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Extracting and aggregating metadata with Tika

At the Glasgow Mashup Peter May created a Python wrapper for Apache Tika. Carl Wilson extended this work, creating a Java utility class that wrapped Tika, providing simple configuration, two types of call to Tika (simple media-type identification and full parse metadata and text extraction), hashing and two output formats - JSON and a simple XML format.

We decided to use Peter and Carl's work as the starting point for this solution as we felt that extracting as much metadata as possible, alongside full text and some textual analysis may lead to a good first pass automated appraisal tool for the two collections. Tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats - mostly MS Office and PDF - both formats that Tika handles very well for both metadata and text extraction.

The solution was constructed in stages starting with a first pass metadata/text extraction of the files. Here the source files were read and a mirror directory format created (eg. if a file was in the original collection at /MyDocuments/InterestingWordFile.docx, then the output from the tool would be placed in a directory /META/MyDocuments/InterestingWordFile.docx/ - with hindsight differentiating the folder with a .meta extension or similar would have been sensible. This was the approach taken to metadata extraction at the Bodleian on the futureArch project and allows a quick mapping from metadata object to source object without any processing. It also means the source object does not have to move from its repository location nor does the metadata extractor require write access to it.

To perform this first stage we constructed a script that walked the directory tree and invoked the Tika wrapper on each document, saving the output as a number of files - the XML, the JSON, a text representation of the content (which worked well for all MS Office formats and PDFs but not so well for the occasional image!), and the hash (SHA256).

This directory structure then formed the basis for several strands of work, with Rob working on some nice Perl scripts that provided simple command line tools to enable the archivists to search for words and duplicates across all of the source files/metadata.

Carl took the metadata output and fed it all into Lucene and demonstrated how the indexes could be queried using Luke to provide similar information. Lucene can also provide some very useful analysis of the collections - popular terms for example - and it would be very interesting to investigate further what additional data could be extracted from the Lucene indexes to aid appraisal.

Pete created a further three scripts that augmented the base metadata with n-gram word clouds (showing the 30 most popular terms and pairs of terms for each document and for the collection as a whole), created normalized full-text extractions that stripped white space, case information, etc. and a SHA256 (to see if we could identify content where the content was identical but the file hashes differed) and finally to generate an aggregate report of the collection much like Peter May's solution, but using a slightly different approach.

Extracting and aggregating metadata with Tika

Solution Champion

Thom Carter, Rebecca Webster

Corresponding Issue(s)
Produce a report summarising collection metadata and content
Sorting, appraising and metadata creation for deposited personal collections

Tool/code link
[Link to Pete's code]

Tool Registry Link
Apache Tika

Evaluation

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.