Characterising Externally Generated Content

Skip to end of metadata
Go to start of metadata
One line summary Tool to create a manifest of digital content, including format and SHA-256 digest, and index content where possible
Detailed description Java code, currently runs as a command line application.  Uses Apache Tika to obtain the content type (mime type) of the file.
Tika also gathers other metadata, dependent upon the file format, such as word count, page count, authors, etc.).
Tika is used to extract the text from the files if possible, see here for a list of supported Tika formats: http://tika.apache.org/0.9/formats.html
The extracted metadata and text content is then used to create a document for Apache Lucene, the file name, relative path, and sha digest are also added to the Lucene document.
The utility then outputs a csv file containing:
  • a running number
  • the relative path to the file (from the collection root)
  • a file name
  • size of the file in bytes
  • the modified date
  • sha 256 digest 
  • the mime type
  • a flag indicating the status of the Tika parse (true if there was a Tika exception).
    Finally a quick summary is output showing the frequency of the file types within the collection.
    The analysis of the word frequency of the generated Lucene index was done by Andrew Jackson, and is detailed here: Analysis of Lucene Index Word Frequency.
Solution champion Carl Wilson
Git link https://github.com/openplanets/AQuA/tree/master/kcl-content-apraisal
Evaluation
  •  Winner of the second AQuA Mashup solution prize, as voted by the event participants
Tool Apache Tika
Apache Lucene
Issue
Unknown born-digital file history
Labels:
solution solution Delete
aqua aqua Delete
characterisation characterisation Delete
fixity fixity Delete
appraisal_assessment appraisal_assessment Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.