Extracting and aggregating metadata with Tika
At the Glasgow Mashup Peter May created a Python wrapper for Apache Tika. Carl Wilson extended this work, creating a Java utility class that wrapped Tika, providing simple configuration, two types of call to Tika (simple media-type identification and full parse metadata and text extraction), hashing and two output formats - JSON and a simple XML format.
We decided to use Peter and Carl's work as the starting point for this solution as we felt that extracting as much metadata as possible, alongside full text and some textual analysis may lead to a good first pass automated appraisal tool for the two collections. Tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats - mostly MS Office and PDF - both formats that Tika handles very well for both metadata and text extraction.
The solution was constructed in stages starting with a first pass metadata/text extraction of the files. Here the source files were read and a mirror directory format created (eg. if a file was in the original collection at /MyDocuments/InterestingWordFile.docx, then the output from the tool would be placed in a directory /META/MyDocuments/InterestingWordFile.docx/ - with hindsight differentiating the folder with a .meta extension or similar would have been sensible. This was the approach taken to metadata extraction at the Bodleian on the futureArch project and allows a quick mapping from metadata object to source object without any processing. It also means the source object does not have to move from its repository location nor does the metadata extractor require write access to it.
To perform this first stage we constructed a script that walked the directory tree and invoked the Tika wrapper on each document, saving the output as a number of files - the XML, the JSON, a text representation of the content (which worked well for all MS Office formats and PDFs but not so well for the occasional image!), and the hash (SHA256).
This directory structure then formed the basis for several strands of work, with Rob working on some nice Perl scripts that provided simple command line tools to enable the archivists to search for words and duplicates across all of the source files/metadata. These usefully allowed the archivist to identify a subset of documents by file path prior to processing for duplicates or identifying files where terms occurred. These tools were great as they provided insight without complexity - for example searching for the term "inadequate" the report produced showed the archivist that this term predominately appeared in the /rejected submissions/ folder, immediately confirming a suspicion the archivist. This also showed how useful it is when the archivist (with the domain knowledge) works alongside the technical expert (with the data mining expertise) and see the "Great! Can we try...?" happen.
Carl took the metadata output and fed it all into Lucene and demonstrated how the indexes could be queried using Luke to provide similar information. Lucene can also provide some very useful analysis of the collections - popular terms for example - and it would be very interesting to investigate further what additional data could be extracted from the Lucene indexes to aid appraisal.
Pete created a further three scripts that augmented the base metadata with n-gram word clouds (showing the 30 most popular terms and pairs of terms for each document and for the collection as a whole) using some code from the futureArch project, created normalized full-text extractions that stripped white space, case information, etc. and a SHA256 (to see if we could identify content where the content was identical but the file hashes differed) and finally to generate an aggregate report of the collection much like Peter May's solution, but using a slightly different approach.
The idea behind the strands was to show how preservation/descriptive metadata could be built over time by a number of different (or new versions of the same) tools with minimum effort/protocol stack. I think we proved the point because Rob's scripts were very successful at eliciting smiles from the archivists and Carl took the metadata and ran with it into Lucene without prior knowledge and he did this is a few hours on the final day. He was also able to extract interesting information from the collections without access to the original documents.
Another nice thing about using a directory structure like this is that it becomes very simple to publish your collection (perhaps add an index page to each item) and can form the basis for a set of linked documents (RDF can be used to connect each resource with others in the same series/sub-series for example). Again, this was the approach taken by the futureArch project to build the collections interface.
It was an interesting experiment with the normalized text and the checksums. I (like others) had expected this to fail to identify duplicate documents. However at least a few duplicates were identified this way even when the file SHA256 had changed. In one case they were copies of the same file with the same file name in different directories. One had a different file name entirely but the content was largely the same - perhaps a case of copying a document to start a different one and then not bothering?
On the futureArch project we asked the archivists to verify the n-gram word clouds against a selection of documents as part of our UI user testing. The archivists suggested the results were appropriate (omitting the obviously less useful terms like "et. al."!). Using the same code here on these two collections we discovered much the same thing - the unintelligent frequency counting of terms is a very effective way of identifying useful terms in a disparate collection like this. Bi-grams often result in names. Tri-grams get sketchy but can return interesting terms on occasion - Society of Archivists for example. In addition the terms gave archivists a better idea of the collection, revealing unexpected terms or confirming suspicions. Where terms were unexpected we were able to use Carl's Lucene index to verify that those terms were indeed popular in the documents.
Ideas for the future? It would be interesting to take this work further to apply clustering/classification to see what else we can learn about the collection. I think the aggregation report could be nicer (for example, currently it shows the files where a given author was found, but the file paths are not links). It would also be interesting to try things like passing all the authors found to a name authority web service. Never did get the chance to send items to OpenCalais but that would be a good thing to try too if the collection permitted it.
[Link to Pete's code]