View Source

| *One line summary* | Create a word frequency list from a Lucene index and try to ascertain the subject matter of the collection that the index was created against. |
| *Detailed description* | The solution for [AQuA:Characterising Externally Generated Content] generated a Lucene index of the collection content.  A small piece of Java code was developed to scan through the terms in the text content field of the Lucene documents (the metadata wasn't trawled).  A list was created of the terms in the index and the frequency of the terms (the number of times that they occurred in the index). \\
The initial results were disappointing as Lucene indexed all of the words and the most frequently used words were ones that occurred commonly in plain English. \\
The General Service List [http://jbauman.com/aboutgsl.html|http://jbauman.com/aboutgsl.html] is a list of commonly occurring words deemed to be most useful to people learning English, and their frequency. Andrew Jackson used this list to determine how much more frequently words were used in the Lucene index, in comparison to "common English", as defined by the GSL. |
| *Solution champion* | [~anjackson]\\ |
| *Git link* | The analysis results (Spreadsheets, csv files, Lucene index, etc.) have been checked into GIT here : [https://github.com/openplanets/AQuA/tree/master/word-freq-compare|https://github.com/openplanets/AQuA/tree/master/word-freq-compare] |
| *Evaluation* | |
| *Tool* | Excel |
| *Issue* \\ | [Unknown born-digital file history|AQuA:Unknown born-digital file history]\\ |

h2. Word frequency clouds

{section}
{column}

h3. Absolute word frequencies

!sampleTop50sampleWeights.png|align=center,border=1,width=300!
{column}
{column}

h3. Word frequencies relative to common English usage

!sampleTop50relativeWeights.png|align=center,border=1,width=300!
{column}
{section}Of course, a lovely next step would be to link each word to the corresponding search results, allowing the context of the usage of the word to be explored.