Use of OCR metadata

Skip to end of metadata
Go to start of metadata
One line summary How can we use OCR metadata to identify pages for human QC investigation?
Detailed description The ABBYY FineReader 9 engine outputs various OCR stats, and these are expressed in the ALTO files. For each page there is a predicted word accuracy percentage, a suspicious character count, a word count, and a suspicious word count. For each OCRed word, there is also a word confidence (0 to 1, where 1 is good) and a character confidence (0 to 9, where 0 is good).
Issue champion Toby Atkin-Wright
Possible approaches Currently the Brightsolid project makes use of the predicted word accuracy (PWA) for each page, calculates the mean PWA across each year of each newspaper, and calculatse the median absolute deviation (MAD). It then marks for manual investigation all pages that have a predicted word accuracy less than (mean - 3x MAD). However, most of these pages are fine, and the variations in OCR quality can be explained by content changes or physical page damage. Is there a better way to use the OCR metadata to find pages that may have questionable scan quality?
Context  
AQuA Solutions  
Collections Brightsolid digitisation of British Library newspapers
Labels:
qa qa Delete
ocr ocr Delete
issue issue Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.