View Source

*Extracting and aggregating metadata with Tika*

At the Glasgow Mashup Peter May created a Python wrapper for Apache Tika. Carl Wilson extended this work, creating a Java utility class that wrapped Tika, providing simple configuration, two types of call to Tika (simple media-type identification and full parse metadata and text extraction), hashing and two output formats - JSON and a simple XML format.

We decided to use Peter and Carl's work as the starting point for this solution as we felt that extracting as much metadata as possible, alongside full text and some textual analysis may lead to a good first pass automated appraisal tool for the two collections. Tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats - mostly MS Office and PDF - both formats that Tika handles very well for both metadata and text extraction.

The solution was constructed in stages starting with a first pass metadata/text extraction of the files. Here the source files were read and a mirror directory format created (eg. if a file was in the original collection at /MyDocuments/InterestingWordFile.docx, then the output from the tool would be placed in a directory /META/MyDocuments/InterestingWordFile.docx/ - with hindsight differentiating the folder with a .meta extension or similar would have been sensible. This was the approach taken to metadata extraction at the Bodleian on the futureArch project and allows a quick mapping from metadata object to source object without any processing. It also means the source object does not have to move from its repository location nor does the metadata extractor require write access to it.

To perform this first stage we constructed a script that walked the directory tree and invoked the Tika wrapper on each document, saving the output as a number of files - the XML, the JSON, a text representation of the content (which worked well for all MS Office formats and PDFs but not so well for the occasional image\!), and the hash (SHA256).

This directory structure then formed the basis for several strands of work, with Rob working on some nice Perl scripts that provided simple command line tools to enable the archivists to search for words and duplicates across all of the source files/metadata.

Carl took the metadata output and fed it all into Lucene and demonstrated how the indexes could be queried using Luke to provide similar information. Lucene can also provide some very useful analysis of the collections - popular terms for example - and it would be very interesting to investigate further what additional data could be extracted from the Lucene indexes to aid appraisal.

Pete created a further three scripts that augmented the base metadata with n-gram word clouds (showing the 30 most popular terms and pairs of terms for each document and for the collection as a whole), created normalized full-text extractions that stripped white space, case information, etc. and a SHA256 (to see if we could identify content where the content was identical but the file hashes differed) and finally to generate an aggregate report of the collection much like Peter May's solution, but using a slightly different approach.

*Extracting and aggregating metadata with Tika*

*Solution Champion*

[~tcarter], [~wrrlibya]

*Corresponding Issue(s)*
[Produce a report summarising collection metadata and content|]
[Sorting, appraising and metadata creation for deposited personal collections|]

*Tool/code link*
[Link to Pete's code]

*[Tool Registry Link|]*
[Apache Tika|]