Extracting and aggregating metadata with Apache Tika

Version 8 by Thom Carter
on Sep 20, 2012 13:44.

compared with
Version 9 by Peter Cliff
on Sep 21, 2012 16:45.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History
*Extracting and aggregating metadata with Tika*

Apache Tika was used with a custom wrapper to extract metadata (e.g. author, title, extent, dates and file formats) and content (text) from files in two large digital archive collections. A script (written in Java) was then used to produce a HTML report summarising the metadata and content across the collection. This information will be used to inform collection management decisions and identify potential preservation issues.
At the Glasgow Mashup Peter May created a Python wrapper for Apache Tika. Carl Wilson extended this work, creating a Java utility class that wrapped Tika, providing simple configuration, two types of call to Tika (simple media-type identification and full parse metadata and text extraction), hashing and two output formats - JSON and a simple XML format.

We decided to use Peter and Carl's work as the starting point for this solution as we felt that extracting as much metadata as possible, alongside full text and some textual analysis may lead to a good first pass automated appraisal tool for the two collections. Tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats - mostly MS Office and PDF - both formats that Tika handles very well for both metadata and text extraction.

The solution was constructed in stages starting with a first pass metadata/text extraction of the files. Here the source files were read and a mirror directory format created (eg. if a file was in the original collection at /MyDocuments/InterestingWordFile.docx, then the output from the tool would be placed in a directory /META/MyDocuments/InterestingWordFile.docx/ - with hindsight differentiating the folder with a .meta extension or similar would have been sensible. This was the approach taken to metadata extraction at the Bodleian on the futureArch project and allows a quick mapping from metadata object to source object without any processing. It also means the source object does not have to move from its repository location nor does the metadata extractor require write access to it.

To perform this first stage we constructed a script that walked the directory tree and invoked the Tika wrapper on each document, saving the output as a number of files - the XML, the JSON, a text representation of the content (which worked well for all MS Office formats and PDFs but not so well for the occasional image\!), and the hash (SHA256).

This directory structure then formed the basis for several strands of work, with Rob working on some nice Perl scripts that provided simple command line tools to enable the archivists to search for words and duplicates across all of the source files/metadata.

Carl took the metadata output and fed it all into Lucene and demonstrated how the indexes could be queried using Luke to provide similar information. Lucene can also provide some very useful analysis of the collections - popular terms for example - and it would be very interesting to investigate further what additional data could be extracted from the Lucene indexes to aid appraisal.

Pete created a further three scripts that augmented the base metadata with n-gram word clouds (showing the 30 most popular terms and pairs of terms for each document and for the collection as a whole), created normalized full-text extractions that stripped white space, case information, etc. and a SHA256 (to see if we could identify content where the content was identical but the file hashes differed) and finally to generate an aggregate report of the collection much like Peter May's solution, but using a slightly different approach.

*Extracting and aggregating metadata with Tika*

*Solution Champion*

[~tcarter], [~wrrlibya]