freqy - word clouds for directories

Skip to end of metadata
Go to start of metadata

freqy - word clouds for directories

Detailed description
A recurring issue in SPRUCE mashups has been when presented with a load of unknown files how does anyone go about cataloging it? freqy is one way to help. It started life as a word cloud library (just using word frequencies in documents) at the Bodliean Library. This was developed further at the last London event - see:

Extracting and aggregating metadata with Apache Tika

and the problem arose again this time around. Discussing with the practitioner and also picking up on a general desire to simplify tools freqy was born. It is a simple tool that, given a directory uses Tika to extract text from any files it finds in the directory or any sub-directories (so supported formats are those that Tika understands) and then counts n-grams (either 1, 2 or 3-gram) and creates a report of the 30 most commonly occurring words/pairs/triplets.

Subsequent development has also added an easy to use GUI.

Solution Champion

Peter Cliff

Corresponding Issue(s)

Simple preservation actions with few IT resources

Tool/code link

https://github.com/petecliff/freqy

Binary here: http://wiki.opf-labs.org/pages/viewpageattachments.action?pageId=28868640&sortBy=date&highlight=freqy.zip&

Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.

Evaluation
Any notes or links on how the solution performed.

Labels:
spruce_london_2 spruce_london_2 Delete
solution solution Delete
keyword keyword Delete
frequency frequency Delete
characterisation characterisation Delete
cataloguing cataloguing Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.