Detailed description At the Archaeology Data Service we have a migration-based preservation strategy so documents are routinely migrated into new formats for either preservation or dissemination. We can batch process these file migrations but what we can not automate at the moment is a check to see that the significant properties of the files have been retained from one version of a file to the next. We often do a random check of a couple of files within the batch but this is not fool proof so a proper check and comparison of a few key quantifiable properties would be really useful.
  • font - may be most important as different fonts can push the page numbering out
  • number of pages
  • number of words
  • number of characters
  • number of images/graphics
Issue champion Jenny Mitcham
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets
Possible Solution approaches There are some tools from previous events out there that already characterise word documents:
  • Apache POI tool on AQUA
  • Office Analyser tool by Andy Jackson
  • _PLANETS tool by Maurice?_Need to see which of these we can use and then create similar ways of characterising PDF, PDF/A and Open Office files (odt/sxw)
Context It is essential that we can demonstrate the authenticity of the files that we are preserving. Checking files after a migration should be a part of this.
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets Archaeology Data Service archive
Solutions Reference to the appropriate Solution page(s), by hyperlink
