Checking that significant properties are preserved after migration

Skip to end of metadata
Go to start of metadata
Title
Checking that significant properties are preserved after migration
Detailed description AFter a file migration (from pdf to pdf/a, doc to docx, doc to pdf/a) we should check that the conversion has been successful and that the significant properties of the object are maintained. We do not do this consistently at present. We may check a handful of files after a batch process but that means we are likely to miss the one conversion that has not been successful. Would be great to have a tool that could open the 2 documents (original and migrated files) and compare serveral quantifiable metrics (for example word count, page count, number of images, paragraph count, anything else) and report on those conversions where the numbers don't match up. These may then be assessed by eye individually and re-migrated if necessary.
Issue champion Jenny Mitcham
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets
Possible Solution approaches
  • Apache POI  for looking inside MS Word docs - see what metrics can be extracted
  • Other technology for PDF files
Context Our Ingest manual states that we carry out checks after migration but we don't always do so. We have discovered in the past some conversions that havent worked properly. For example doc to odf used to cause problems with pages being slightly out. We often find out where there are issues and problems on an ad hoc basis though and it would be better if we had a more fool proof method of assessing the success of file conversions.
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets ADS Grey Literature Library
Solutions Reference to the appropriate Solution page(s), by hyperlink
Labels:
issue issue Delete
york_hackathon york_hackathon Delete
qa qa Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.