Identifying missed or duplicated pages

Skip to end of metadata
Go to start of metadata

Note that this is a blank proforma. Please make a copy of it, before filling out the form!

One line summary Identifying missed or duplicated pages in books, archives and manuscripts
Detailed description Multi-paged items form the vast bulk of digitisation projects. There is always the risk that pages will be missed, or double-shot when carrying out high-throughput digitisation (even when "high throughput" means relatively slow special collections photography, but it's still large scale and target-driven). Even if 100% QA post-digitisation is employed, errors are likely to occur, and obviously it is far more time-consuming to fix this once images have been committed to a DAM, OCR'ed, etc. than directly after capture.

OCR can be used to identify page numbers in printed works, and therefore analyse the page order. This works for printed items, but even then, it comes rather late in the workflow.

It would be useful if there was any way to work out before OCR, or for images that cannot be OCR'ed, whether these errors exist.
Issue champion  
Possible approaches For printed items, a lightweight OCR of pages targetting page numbers? May not be worth the extra OCR'ing time.
For duplicated images, see a related issue from Identifying rotated, duplicated images using pHash.
No approach has yet been suggested for pages skipped from non-printed textual works.,


Context Wellcome Library
AQuA Solutions
Collections Wellcome Library Digitisation
Wellcome Library digitisation EAP (difficulty applying this to mss material and variated pages/images)
Labels:
qa qa Delete
image image Delete
ocr ocr Delete
issue issue Delete
duplication duplication Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.