View Source

| *One line summary* | How can we use OCR metadata to identify pages for human QC investigation? |
| *Detailed description* | The ABBYY FineReader 9 engine outputs various OCR stats, and these are expressed in the ALTO files. For each page there is a predicted word accuracy percentage, a suspicious character count, a word count, and a suspicious word count. For each OCRed word, there is also a word confidence (0 to 1, where 1 is good) and a character confidence (0 to 9, where 0 is good). |
| *Issue champion* | Toby Atkin-Wright |
| *Possible approaches* | Currently the Brightsolid project makes use of the predicted word accuracy (PWA) for each page, calculates the mean PWA across each year of each newspaper, and calculatse the median absolute deviation (MAD). It then marks for manual investigation all pages that have a predicted word accuracy less than (mean - 3x MAD). However, most of these pages are fine, and the variations in OCR quality can be explained by content changes or physical page damage. Is there a better way to use the OCR metadata to find pages that may have questionable scan quality? |
| *Context* | |
| *AQuA Solutions* | |
| *Collections* | [Brightsolid digitisation of British Library newspapers] |