Distinguishing Files with Descriptive Metadata
A Java program making use of a custom Apache Tika wrapper to extract file format identification and metadata from a directory of files and present aggregated data for identifying which files have full descriptive metadata and which don't. Output is a HTML page listing each file, its format type, the descriptive keywords present in its metadata, plus (for text documents) a simple word cloud giving an indication of the file's contents. Summary information about the number of files per format and the number of files with full, some and no descriptive metadata is also given.
Descriptive Metadata
Two lists of descriptive metadata keywords are maintained, one for image mime formats, the other for text or application mime formats.
For each file, based on its identified format, the appropriate "descriptive" keywords are searched for in that file's metadata output. If the metadata contains all the required keywords then the file is marked as green (for containing full descriptive metadata). If the metadata contains some of the keywords the file is marked orange (for containing some descriptive metadata). If the metadata contains none of the keywords, then the file is marked red (for containing no descriptive metadata).
Instructions can be found in the README.md file available in the github repository.
Solution Champion
Corresponding Issue(s)
Tool/code link
https://github.com/openplanets/SPRUCE/tree/master/TikaDescMdataAnalyser
Tool Registry Link
Tika
Evaluation
Any notes or links on how the solution performed.
Solution Champion Evaluation
Ultimately the tool works well at summarising the file set provided, enabling, at a glance, a user to view exactly what the state of descriptive metadata is within a collection. Such overview could be useful for analysing existing collections, particularly where effort estimations are needed to prepare files. It could also be useful for analysing SIPs to ensure that file providers and providing sufficient and required metadata; if not, it may be possible to offload metadata preparation work to the submitters.
The choice of keywords for descriptive metadata is important. A file is only marked green if it contains every keyword in its format keyword list. In many cases however, there were similar keywords, for example "Author" and "meta:author". Consideration should be given to whether one specific keyword must be present, whether a choice is allowable (either Author or meta:author), or whether such metadata is not required. This is an application workflow and institution specific requirement.
Descriptive keywords and which file formats they are applied to is currently very limited. This really needs deeper investigation to get the best output from the tool. At the moment, file formats beyond application/*, text/* and image/* are not tested against any keyword list and will therefore appear as red; this does not necessarily mean they don't contain descriptive metadata.
Relatedly, the descriptive keywords are hard-coded into the program, ideally these should be moved out to a separate configuration file to enable those with limited IT experience to use the program.
Tika fails to identify some word documents, returning an octet-stream format type (incidentally, unix file v5.11 returns "data"!). This will obviously impact the application of keyword list and selection of metadata.
The program executes, for approx 150 files (word, pdf, text, images), in 20-30 seconds, resulting in a ~300KB file. Consideration should be given for execution times and output file sizes for large collections. Simplistically, the tool could be run on subsections of a repository by specifying sub-directories to execute over.
The Word Cloud is passed the entire file, not just the text content of a document, resulting in strange "word" selections in the cloud. This could be fixed by generating a cloud during the initial Tika parsing of the file and storing this in the FileMetaInformation object for each file. The Word Cloud also suffers from odd character encodings within documents, resulting in odd words being displayed to the user. There is also strange word tokenisation, for example splitting words on apostrophes resulting in terms like "don" (don't) being returned.
Images appear to contain thumbnail image data which could be used to present a thumbnail image in the HTML output.
Suggestions
Looking forwards, whilst this HTML approach provided quick output for a hackathon, perhaps a fully formed (Java?) application would be more beneficial. Data could be stored in a relevant database rather than a single HTML file.
Alternatively, a server side application could perform the actual work and provide a web front end for user interaction - this could work well in a cloud based infrastructure.
One could also imagine the possibility of further workflow integration, for example by allowing files to be selected from the list, with the user presented with a descriptive metadata entry page. Or multiple files could be selected with the same metadata applied to all (e.g. copyright metadata may be consistent across all). Retaining of the original files should be considered in this context.
The descriptive metadata values could also be analysed to ensure that any necessary format or structure is followed, for example author names may be required to be listed in "Surname, given name" format.