Note that this is a blank proforma. Please make a copy of it, before filling out the form!
|One line summary|| Identifying missed or duplicated pages in books, archives and manuscripts
|Detailed description|| Multi-paged items form the vast bulk of digitisation projects. There is always the risk that pages will be missed, or double-shot when carrying out high-throughput digitisation (even when "high throughput" means relatively slow special collections photography, but it's still large scale and target-driven). Even if 100% QA post-digitisation is employed, errors are likely to occur, and obviously it is far more time-consuming to fix this once images have been committed to a DAM, OCR'ed, etc. than directly after capture.
OCR can be used to identify page numbers in printed works, and therefore analyse the page order. This works for printed items, but even then, it comes rather late in the workflow.
It would be useful if there was any way to work out before OCR, or for images that cannot be OCR'ed, whether these errors exist.
|Possible approaches|| For printed items, a lightweight OCR of pages targetting page numbers? May not be worth the extra OCR'ing time.
For duplicated images, see a related issue from Identifying rotated, duplicated images using pHash.
No approach has yet been suggested for pages skipped from non-printed textual works.,
|Context|| Wellcome Library
|Collections|| Wellcome Library Digitisation
Wellcome Library digitisation EAP (difficulty applying this to mss material and variated pages/images)