freqy - word clouds for directories
Detailed description
A recurring issue in SPRUCE mashups has been when presented with a load of unknown files how does anyone go about cataloging it? freqy is one way to help. It started life as a word cloud library (just using word frequencies in documents) at the Bodliean Library. This was developed further at the last London event - see:
Extracting and aggregating metadata with Apache Tika
and the problem arose again this time around. Discussing with the practitioner and also picking up on a general desire to simplify tools freqy was born. It is a simple tool that, given a directory uses Tika to extract text from any files it finds in the directory or any sub-directories (so supported formats are those that Tika understands) and then counts n-grams (either 1, 2 or 3-gram) and creates a report of the 30 most commonly occurring words/pairs/triplets.
Subsequent development has also added an easy to use GUI.
Solution Champion
Corresponding Issue(s)
Simple preservation actions with few IT resources
Tool/code link
https://github.com/petecliff/freqy
Binary here: http://wiki.opf-labs.org/pages/viewpageattachments.action?pageId=28868640&sortBy=date&highlight=freqy.zip&
Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.
Evaluation
Any notes or links on how the solution performed.