|One line summary||Working out what can, and what cannot be OCR'ed in mixed archival content|
|Detailed description|| Many archival collections contain mixed content, that is not filed according to material type. So a "deliverable unit" - such as a file of correspondence, or notebook, etc. - may contain items that are perfectly suitable for OCR, those that are borderline, and those that are completely unsuitable (handwritten documents, photographs, etc.).
In the case of the Wellcome Library digitisation project, the archives are relatively recent (mainly post-war), and up to 90% is OCR'able/borderline. This rough percentage is recorded for each "folder" (deliverable unit), but no on an item-by-item basis (which would be too time consuming).
The issue was whether it was possible to programmatically determine which images in a given set of images would be OCR'able.
|Issue champion|| Christy Henshaw
|Possible approaches|| OCR'ing images as an advance test to detect confidence in a small portion of the image, or in multiple spots (which may be quicker than full OCR). Compare against a threshold where images performing below a certain confidence are removed from the set that gets full OCR.
No solutions were created or proposed. It was suggested to fully OCR everything, and then simply remove the OCR results that fall below par once the raw text is produced. This is risky, as it will cost more (images that should not be OCR'ed will get OCR'ed), but it is probably not worthwhile to OCR images twice (i.e. once as a sample, to weed out the non-OCR'able stuff, and again as "full" OCR).
This may not apply to other types of collections, where, say, the OCR'able content is in the minority.
|Context|| Wellcome Library
|Collections|| Wellcome Library Digitisation EAP (difficulty applying this to mss material and variated pages/images)
Skip to end of metadata Go to start of metadata