Use of OCR metadata

compared with
Current by Toby Atkin-Wright
on Jun 15, 2011 09:17.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
| *Detailed description* | The ABBYY FineReader 9 engine outputs various OCR stats, and these are expressed in the ALTO files. For each page there is a predicted word accuracy percentage, a suspicious character count, a word count, and a suspicious word count. For each OCRed word, there is also a word confidence (0 to 1, where 1 is good) and a character confidence (0 to 9, where 0 is good). |
| *Issue champion* | Toby Atkin-Wright |
| *Possible approaches* | Currently the Brightsolid project takes the predicted word accuracy for each page, calculates the mean across each year of each newspaper, and calculate the median absolute deviation (MAD). It then marks all pages that have a predicted word accuracy less than (mean - 3x MAD) for manual checking. However, most of these pages are perfectly fine, and the variations in OCR quality can be explained by content changes or physical page damage. Is there a better way to use the OCR metadata to find pages that may have questionable scan quality? |
| *Possible approaches* | Currently the Brightsolid project makes use of the predicted word accuracy (PWA) for each page, calculates the mean PWA across each year of each newspaper, and calculatse the median absolute deviation (MAD). It then marks for manual investigation all pages that have a predicted word accuracy less than (mean - 3x MAD). However, most of these pages are fine, and the variations in OCR quality can be explained by content changes or physical page damage. Is there a better way to use the OCR metadata to find pages that may have questionable scan quality? |
| *Context* | |
| *AQuA Solutions* | |