What aspects of our digital collections do we want to automatically validate? As it happens, there are lots of problems that may be lurking away inside our digital objects and without automatic methods of detecting them, we won't know for sure our collections are problem free and ready to stand the test of time posed by technology changes. Here are some examples of what we would like to validate and how we might be able to implement that validation in a friendly, cost effective and automated manner:
- Validating the construction and integrity of files where bit-rot may have occurred. This may include validation by comparing replicated collections (eg. fuzzy matching of master and service images to identify image decay) or via matching of files against format specifications. This is particularly important where content has been stored on hand held media without checksums and may have decayed (JHOVE, Planets tools, Imagemagick).
- Validating the structural integrity of a collection, ensuring that expected files are present (eg. each issue has the expected number of pages), or that content, filename metadata and accompanying structural metadata is consistent. This may include novel approaches such as looking for 0 length or uncharacteristically short files. Again, this is crucial where bit rot may have occurred (basic file system operations, metadata parsing tools, eg. XMLStarlet).
- Validating key content characteristics, such as file format, colour depth, compression. This is useful in assessing content digitised in distributed on demand services, where basic characteristics are unclear and file extensions and accompanying metadata are untrustworthy (JHOVE, DROID, Planets tools).
- Identifying duplicate pages created by accidental scanning of the same page twice or in collections consolidated from a number of on demand digitisation sources. Fuzzy matching via innovative image processing solutions or fuzzy OCR text matching may be investigated (Imagemagick, XMLStarlet).
- Identifying missing pages due to file loss or digitisation error through analysis of OCR text (Functional Extension Parser developed by the Impact Project).
- Identifying digitisation quality issues by creating a profile of a typical page/recording and flagging up files that deviate substantially from that profile for further manual investigation. This may include image processing characteristics such as Peak Signal to Noise Ratio or analysis of metadata such as OCR success rates (Imagemagick, XMLStarlet).