freqy - word clouds for directories

Detailed description
A recurring issue in SPRUCE mashups has been when presented with a load of unknown files how does anyone go about cataloging it? freqy is one way to help. It started life as a word cloud library (just using word frequencies in documents) at the Bodliean Library. This was developed further at the last London event - see:

Extracting and aggregating metadata with Apache Tika

and the problem arose again this time around. Discussing with the practitioner and also picking up on a general desire to simplify tools freqy was born. It is a simple tool that, given a directory uses Tika to extract text from any files it finds in the directory or any sub-directories (so supported formats are those that Tika understands) and then counts n-grams (either 1, 2 or 3-gram) and creates a report of the 30 most commonly occurring words/pairs/triplets.

Subsequent development has also added an easy to use GUI.

Solution Champion

Peter Cliff

Corresponding Issue(s)

Simple preservation actions with few IT resources

Tool/code link

