
h4. Distinguishing Files with Descriptive Metadata
h4.
A Java program making use of a custom Apache Tika wrapper to extract file format identification and metadata from a directory of files and present aggregated data for identifying which files have full descriptive metadata and which don't. Output is a HTML page listing each file, its format type, the descriptive keywords present in its metadata, plus (for text documents) a simple word cloud giving an indication of the file's contents. Summary information about the number of files per format and the number of files with full, some and no descriptive metadata is also given.
h5. *{_}Descriptive Metadata{_}*
h5.
Two lists of descriptive metadata keywords are maintained, one for image mime formats, the other for text or application mime formats.
For each file, based on its identified format, the appropriate "descriptive" keywords are searched for in that file's metadata output. If the metadata contains all the required keywords then the file is marked as green (for containing full descriptive metadata). If the metadata contains some of the keywords the file is marked orange (for containing some descriptive metadata). If the metadata contains none of the keywords, then the file is marked red (for containing no descriptive metadata).
Instructions can be found in the README.md file available in the github repository.
h4. *Solution Champion*
[~pmay]
h4. *Corresponding Issue(s)*
[SPR:Metadata extraction]
h4. *Tool/code link*
[https://github.com/openplanets/SPRUCE/tree/master/TikaDescMdataAnalyser]
h4. *[Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]*
[TR:Tika]
h4. *Evaluation*
_Any notes or links on how the solution performed._
h5. *Solution Champion Evaluation*
Ultimately the tool works well at summarising the file set provided, enabling, at a glance, a user to view exactly what the state of descriptive metadata is within a collection. Such overview could be useful for analysing existing collections, particularly where effort estimations are needed to prepare files. It could also be useful for analysing SIPs to ensure that file providers and providing sufficient and required metadata; if not, it may be possible to offload metadata preparation work to the submitters.
The choice of keywords for descriptive metadata is important. A file is only marked green if it contains every keyword in its format keyword list. In many cases however, there were similar keywords, for example "Author" and "meta:author". Consideration should be given to whether one specific keyword must be present, whether a choice is allowable (either Author or meta:author), or whether such metadata is not required. This is an application workflow and institution specific requirement.
Descriptive keywords and which file formats they are applied to is currently very limited. This really needs deeper investigation to get the best output from the tool. At the moment, file formats beyond application/*, text/\* and image/\* are not tested against any keyword list and will therefore appear as red; this does not necessarily mean they don't contain descriptive metadata.
Relatedly, the descriptive keywords are hard-coded into the program, ideally these should be moved out to a separate configuration file to enable those with limited IT experience to use the program.
Tika fails to identify some word documents, returning an octet-stream format type (incidentally, unix file v5.11 returns "data"\!). This will obviously impact the application of keyword list and selection of metadata.