One line summary | Create a word frequency list from a Lucene index and try to ascertain the subject matter of the collection that the index was created against. |
Detailed description | The solution for Characterising Externally Generated Content generated a Lucene index of the collection content. A small piece of Java code was developed to scan through the terms in the text content field of the Lucene documents (the metadata wasn't trawled). A list was created of the terms in the index and the frequency of the terms (the number of times that they occurred in the index). The initial results were disappointing as Lucene indexed all of the words and the most frequently used words were ones that occurred commonly in plain English. The General Service List http://jbauman.com/aboutgsl.html ![]() |
Solution champion | ![]() |
Git link | The analysis results (Spreadsheets, csv files, Lucene index, etc.) have been checked into GIT here : https://github.com/openplanets/AQuA/tree/master/word-freq-compare![]() |
Evaluation | |
Tool | Excel |
Issue |
Unknown born-digital file history |
Word frequency clouds
Absolute word frequencies |
Word frequencies relative to common English usage |