AQDC - Document Compare

Version 1 by Peter Cliff
on Jun 17, 2011 12:47.

compared with
Current by Paul Wheatley
on May 16, 2012 16:07.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (15)

View Page History
| *One line summary* | Tool that used Apache Tika to parse & compare documents. \\ |
| *Detailed description* | AQDC is a Spring MVC Framework based Web application that wraps Apache Tika to provide a quick analysis of two documents (typically the original and its migration). \\
\\
There is a simple Web form containing two fields (original and migrated file). On submit these files are uploaded to the Web app and parsed by Tika's "AutoDetectParser". \\
There is no error checking so don't be surprised to see 500 errors from time to time\! \:-) \\
\\
The parser is hooked into a couple of parsers - notably text and xhtml generators, along with some basic (and usually wrong\! \:-)) language identification. \\
\\
Armed with the text the Web app then performs a couple of checks: \\
* Normalise the full text and see if they match
* Run the text through a word-frequency analyser and see if the most popular words match
* Generate a tag cloud for each document
* Check word counts\\
Hopefully on using this it'll become clear why just checking for document characteristics does not give a clear indication as the the success or otherwise of the migration. \\ |
| *Solution champion* | Pete Cliff |
| *Solution champion* | [~pxuxp]\\ |
| *Git link* | [https://github.com/openplanets/AQuA/tree/master/aqDocCompare] |
| *Evaluation* | * Interesting comparitive results of extraction using different tools (please document above\!) - different tools extract metadata fields differently ... More detail to follow
* Visualisation approach, excellent\!
** Don't need to set spec in advance and then see what conformance is - much more flexible/agile
** Works for a non specialist |
| *Tool* (link) | There is a self-contained (Jetty) server that can be downloaded from GitHub and run on your own machine. \\ |