Born-digital - migration success

Skip to end of metadata
Go to start of metadata
One line summary Checking whether an automated normalisation produces a surrogate of sufficient quality ...
Detailed description "sufficient" obviously needs to be defined in terms of significant properties relevant to the context but are there some checks which can be run to determine whether the migration was successful in terms of of some minimum, generally accepted quality criteria? 

Initial criteria based on workflow tool suggestions:http://archivematica.org/wiki/index.php?title=Significant_characteristics_of_word_processing_files
  • page count
  • word count
  • character count
  • paragraph count
  • line count
  • table count
  • graphics count
  • language
  • fonts
  • features


    Post-solution note:

    Some of these characteristics (success metrics if you like) are easy to test and can be extracted from document files quite readily - using Apache POI to interrogate MS Office documents for example. However the difficulty comes when attempting to compare characteristics that are readily available from structured formats, but quickly disappear in unstructured formats like PDF migrations used for access.

    In some cases simply counting words is insufficient too because migration tools may add (quite deliberately) headers or footers that add notes that this is a migration, a copyright statement, etc. Here a migration could fail to match on characteristics and yet remain a good migration for access. (There is always this underlying issue of migration for preservation and migration for access and the tensions between the two. You have to be strict with the former, can worry less with the latter, and yet it would be nice to use the same tools for both!).

    In my experience converting born-digital materials for access, the issue is not if all the words are there or not, but rather are they in the right place? That is to say, have the headings moved, the tables become to wide, etc.? We thought we could perhaps assess migrations this way by converting original and migration to an image format and using something like PerceptualDiff to compare if the two documents have text/images, etc in the same places. (With hind sight I wonder now if OCR software could also be used for this?).

    However, we ran into problems with this approach because it was often the case that to get the Word document and it's migration in PDF into a comparable format, it was necessary to convert both to a jpeg or similar. There is (commercial) jpeg printer software and there is also Word to Image converters (also proprietary). I could not use this because I didn't want to buy the license just for a mashup! :-) So how then could I get both a Word document and a PDF into a jpeg to compare them?

    A Google search suggested that to get Word to jpeg I had a couple of choices. Firstly I discovered one person suggesting doing a Print-Screen of the document when open in Word (no, really!). The second resource suggested creating a PDF from the Word document and converting that to a jpeg!

    This gets to the crux of the problem. Once a document has been migrated to a new format, there is very little to use as a success metric. It may be possible to test a Word 98 to Word 2010 migration (a preservation action) by counting the tables, but it would not be possible to test a Word 98 to PDF/A in the same way as the PDF has discarded that information.

    Using an intermediary format is less satisfactory too because if you have the original and the migration and move them both to a further migration, all you are testing is the closeness (or otherwise) to two further migrations and not the closeness of the original to the migration.


    Tricky issue and we didn't find a resolution to it. This led us to consider images instead! :-)|
Issue champion Pete Cliff
Possible approaches
  • convert original and migrated file to text and compare word counts, indentations (space counts) - run statistical analysis for similarity
  • render pages to images and compare using Perceptal Image Diff? |
Context  
AQuA Solutions http://wiki.opf-labs.org/display/AQuA/AQDC+-+Document+Compare
Collections Outputs from born-digital ingest workflow
Labels:
qa qa Delete
comparison comparison Delete
characterise characterise Delete
office office Delete
pdf pdf Delete
issue issue Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.