This line was removed.
This word was removed. This word was added.
This line was added.
Changes (1)View Page History
The program executes, for approx 150 files (word, pdf, text, images), in 20-30 seconds, resulting in a \~300KB file. Consideration should be given for execution times and output file sizes for large collections. Simplistically, the tool could be run on subsections of a repository by specifying sub-directories to execute over.
The Word Cloud is passed the entire file, not just the text content of a document, resulting in strange "word" selections in the cloud. This could be fixed by generating a cloud during the initial Tika parsing of the file and storing this in the FileMetaInformation object for each file. The Word Cloud also suffers from odd character encodings within documents, resulting in odd words being displayed to the user. There is also strange word tokenisation, for example splitting words on apostrophes resulting in terms like "don" (don't) being returned.
Images appear to contain thumbnail image data which could be used to present a thumbnail image in the HTML output.