AQDC - Document Compare

Skip to end of metadata
Go to start of metadata
One line summary Tool that used Apache Tika to parse & compare documents.
Detailed description AQDC is a Spring MVC Framework based Web application that wraps Apache Tika to provide a quick analysis of two documents (typically the original and its migration).

There is a simple Web form containing two fields (original and migrated file). On submit these files are uploaded to the Web app and parsed by Tika's "AutoDetectParser".
There is no error checking so don't be surprised to see 500 errors from time to time! :-)

The parser is hooked into a couple of parsers - notably text and xhtml generators, along with some basic (and usually wrong! :-)) language identification.

Armed with the text the Web app then performs a couple of checks:
  • Normalise the full text and see if they match
  • Run the text through a word-frequency analyser and see if the most popular words match
  • Generate a tag cloud for each document
  • Check word counts
    Hopefully on using this it'll become clear why just checking for document characteristics does not give a clear indication as the the success or otherwise of the migration.
Solution champion Peter Cliff
Git link https://github.com/openplanets/AQuA/tree/master/aqDocCompare
Evaluation
  • Interesting comparitive results of extraction using different tools (please document above!) - different tools extract metadata fields differently ... More detail to follow
  • Visualisation approach, excellent!
  • Collection owner likes ability to quickly check consistency across a large collection
    • Don't need to set spec in advance and then see what conformance is - much more flexible/agile
    • Works for a non specialist
Tool (link) There is a self-contained (Jetty) server that can be downloaded from GitHub and run on your own machine.
Labels:
aqua aqua Delete
solution solution Delete
quality_assurance quality_assurance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.