OCR Comparison

Skip to end of metadata
Go to start of metadata
One line summary Compare two different OCR results. If the results are not sufficiently close, the source pages may be different indicating possible issues.
Detailed description See detailed scenario descriptions below.
Solution champion Georg Petz & Sven Schlarb
Group Evaluation Notes
  • Applied the OCR evaluation/assessment approach to both ONB and BL collection samples
  • OCR can take a long time (minutes) per file
  • OCR matching approach seems to work well in principle, but lengthy OCR may be a problem. Potential to exploit parallel computing approach in SCAPE to enable application of this approach to a large dataset in a reasonable length of time?
  • Would be useful to test approach with edge cases of partially corrupt files - BL can make this available for next event
  • Could apply this approach to evaluate use of lossy compression in preserving master images, and verify that we are not impacting on future OCR quality by throwing away some detail in lossy compression
  • Technique could easily be applied to other collections
Detailed Evaluation
  1. How well does the solution meet your issue?
      1. The solution provides a useful generic approach to paired image and/or OCR validation. Further testing is required to examine edge cases with (for example) partially corrupt files.
  2. What more would you like the solution to do?
  3. Do you think you can implement the solution in your organisation?
      1. Good potential to embed the solution, although scalability is probably an issue. As noted above, parallel computing approach may mitigate the long OCR times.
    1. What further investigation/development/testing would be required before implementation at your organisation?
    2. Are there any process, workflow or technical obstacles to implementation?
  4. Summarise the benefits to your organisation that the solution could provide?
      1. Thorough QA of file migration, particularly where the resulting files are close but not exactly the same as the original would be particularly valuable. WIthout this additional QA, errors can pass unnoticed.
  5. What potential exists to apply the solution elsewhere? eg. with other collections, in other organisations, or to meet other issues?
      1. Excellent potential to apply to other cases with minimal work. Verification of level of lossy compression (noted above) would be an interesting application of the workflow.
  6. What more would you like to understand and/or document about the issue and/or the solution?
      1. Further testing is probably the highest priority, with perhaps some further work to assess the comparison between the OCR results.
Issue Quality issues may be present in digitised pages

Scenario 1: Compare hOCR instances with Tesseract OCR results

In the first scenario we used a list of 11 JPEG2000 input files. In order to be able to apply OCR it was necessary to convert the JPEG2000 before passing it to the OCR.

In the diagram below, "jp2_list" takes the URL reference to a textfile that contains URL references to JPEG2000 image files as input.

The "jp2_list" port would get the URL pointing to a textfile that contains a list of URLs of JPEG2000 images, like the following, for example:

http://<someserver>/tmp/00000001.jp2
http://<someserver>/tmp/00000002.jp2
http://<someserver>/tmp/00000003.jp2
http://<someserver>/tmp/00000004.jp2
...

and the "hocr_list" would get the URL pointing to a textfile that contains a list of URLs of hOCR instances, like the following, for example:

http://<someserver>/tmp/00000001.txt
http://<someserver>/tmp/00000002.txt
http://<someserver>/tmp/00000003.txt
http://<someserver>/tmp/00000004.txt
...

and the "tess_langmod" is simply a string that indicates the language module that Tesseract Version 3 should use, for example:

"eng"

for the English language module of Tesseract Version 3.

First, the workflow reads the content of the textfiles above and creates a list of URLs. The list of JPEG2000 URLs is passed to the converter that converts them to TIFF files, and the list of hOCR instances is passed to a service that converts them to plain text.

The result of the Tesseract OCR is then compared to the plain text of the hOCR instances.

The following numbers indicate a match measurement between the Tesseract OCR result and the hOCR plain text result:

57.51%
36.39%
69.40%
65.07%
66.49%
71.01%
70.38%
64.89%
65.33%
56.65%
64.86%

For the sake of simplicity, let us assume that these numbers correctly indicate the match measurement.

Using these match measurements, we could now create a rule:

If for at least 5 book pages the match measurement falls below 30%, then the book is flagged as potentially problematic.

while the 30% value could be a configurable threshold.

Scenario 2: Compare the OCR of migrated JPEG2000 images with the original TIFF images in order to detect corruption caused by the conversion

In the second scenario we want to find out if we can use OCR in order to detect possible data corruption as a result of a file format migration from JPEG2000 to TIFF.

The following picture shows the original TIFF image on the left, and the corrupted image on the right side.

Using the following Taverna workflow, we applied Tesseract Version 3 OCR to both images and then compared the OCR results.

The OCR result for the original TIFF image varies a lot. Due to some quality deficiencies that are present in the scanned newspaper page, some parts contain more errors, like in the following excerpt, for example:

gf 1_»u:L.41vD. ,
.4-
DUBLIN, Sxrunnn Momuso.
mu ova owv coanmsronnsxm] _
S-WF OF zised to be sold yesterdal, one of which
THE MORNING CHRONICLE, MONDAY, NOVEMBER 17, 1851. _
reigns in the heuu of his taunt luyd
, bg.; "cond to none in I I d _ Y!-I lord, If not the
hum; long and happy lif:e[r:ud'ol?e;¥a]?°d me" md ‘nm
Lord Eau: than role and uid Mr Sh , I thought thu
““‘°' "‘ ‘"“°** YW hm pwposod my health, .ua hkemn
[F Ewcuunsnsr Esrxrns.-There were ten z youu
‘ 3 ‘3'"°° “d I "Nik you for the kmd and tlattenng I

and other parts are of higher quality, like in the following excerpt, for example:

Imsu .wxs or Issvn.-The following return shows an
increase in the amount of note circulation for the four weeks
of £496,984, as compared with the previous retum. During
the four weeks embraced in the return, the bullion in the
banks decreased to the umount. of £36,374 :-
An Account pursuant to the Act. 8 and 9 Vict. cap. 37, of the
Amount ol Bank Notes amhoriaed by luv to be issued by
the several Banks of Issue in Irehmd, and the Avenee
Amount of Bank Notes in circulation, md of Coin he d
during the Four Weeks ending Saturday, the lst dsy of
November, 1851 :-

But if compare this with the result of the OCR of the corrupted image, the difference becomes evident, like in the following excerpt, for example:

-w.       »:-:,'§=s;>{}¢_?".;,~»- *ze   ° V- .  ,»
4.,   ‘ FU: ' 'Jr 1   __»ir;';g‘   'F'   ° ( 'W os", _tru 'W 'ai' ’\  `¢
1 _,     »4<. _,: " _ '' ' . ._ _ _ » _ _
Ulhvr. 9 ` W. . A | it ' _ _   'N-`€f';fQ   g‘f _ °   il: * Q 4 ‘ ¢'|
  Q( ' bf,    ' ` 1 ,‘ 1'i‘\;.»_:i}¢'('o:"Si‘* iw 5| _ . 4   \ ._ ~ -g _ _ x ' _& » -» 4* :
'F ®   {‘|] '1 1 . ¢°‘#\Q 'A:?[, . »r(:   ' Pla’ .Vi °‘ ‘ ° /1 ‘_   'C' , 3|
°" * N* '~ “ -  F2 '*%=‘=?2»:f¢‘ ~=<¢m=» »;   » »  ff; fa ~»-
fb |'»'~|'AH*  S 9   ‘°"°»; pf i~T‘3‘ ra,-gg; " #U  `° J  3

Tesseract tries to recognize text, and the result is that the comparison returns no match at all for the original TIFF file compared to the JPEG 2000 derivate.

As a conclusion, the difference between the OCR results can be used as an indicator in order to detect possible data corruption during migration.

Labels:
taverna taverna Delete
workflow workflow Delete
ocr ocr Delete
comparison comparison Delete
tiff tiff Delete
jpeg2000 jpeg2000 Delete
tesseract tesseract Delete
hocr hocr Delete
quality_assurance quality_assurance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.