One line summary | Tool to create a manifest of digital content, including format and SHA-256 digest, and index content where possible |
Detailed description | Java code, currently runs as a command line application. Uses Apache Tika to obtain the content type (mime type) of the file. Tika also gathers other metadata, dependent upon the file format, such as word count, page count, authors, etc.). Tika is used to extract the text from the files if possible, see here for a list of supported Tika formats: http://tika.apache.org/0.9/formats.html ![]() The extracted metadata and text content is then used to create a document for Apache Lucene, the file name, relative path, and sha digest are also added to the Lucene document. The utility then outputs a csv file containing:
|
Solution champion | ![]() |
Git link | https://github.com/openplanets/AQuA/tree/master/kcl-content-apraisal![]() |
Evaluation |
|
Tool | Apache Tika ![]() Apache Lucene |
Issue |
Unknown born-digital file history |
Labels: