View Source

| *One line summary* | Tool to create a manifest of digital content, including format and SHA-256 digest, and index content where possible |
| *Detailed description* | Java code, currently runs as a command line application.  Uses Apache Tika to obtain the content type (mime type) of the file. \\
Tika also gathers other metadata, dependent upon the file format, such as word count, page count, authors, etc.). \\
Tika is used to extract the text from the files if possible, see here for a list of supported Tika formats: [|]\\
The extracted metadata and text content is then used to create a document for Apache Lucene, the file name, relative path, and sha digest are also added to the Lucene document. \\
The utility then outputs a csv file containing: \\
* a running number
* the relative path to the file (from the collection root)
* a file name
* size of the file in bytes
* the modified date
* sha 256 digest 
* the mime type
* a flag indicating the status of the Tika parse (true if there was a Tika exception). \\
Finally a quick summary is output showing the frequency of the file types within the collection. \\
The analysis of the word frequency of the generated Lucene index was done by Andrew Jackson, and is detailed here: [AQuA:Analysis of Lucene Index Word Frequency]. \\ |
| *Solution champion* | [~carlwilson-bl]\\ |
| *Git link* | [|]\\ |
| *Evaluation* | *  Winner of the second AQuA Mashup solution prize, as voted by the event participants |
| *Tool* | [Apache Tika |]\\
Apache Lucene |
| *Issue* \\ | [Unknown born-digital file history|AQuA:Unknown born-digital file history]\\ |