One line summary | Compare two different OCR results. If the results are not sufficiently close, the source pages may be different indicating possible issues. |
Detailed description | See detailed scenario descriptions below. |
Solution champion | Georg Petz & Sven Schlarb |
Group Evaluation Notes |
|
Detailed Evaluation |
|
Issue | Quality issues may be present in digitised pages |
Scenario 1: Compare hOCR instances with Tesseract OCR results
In the first scenario we used a list of 11 JPEG2000 input files. In order to be able to apply OCR it was necessary to convert the JPEG2000 before passing it to the OCR.
In the diagram below, "jp2_list" takes the URL reference to a textfile that contains URL references to JPEG2000 image files as input.
The "jp2_list" port would get the URL pointing to a textfile that contains a list of URLs of JPEG2000 images, like the following, for example:
http://<someserver>/tmp/00000001.jp2
http://<someserver>/tmp/00000002.jp2
http://<someserver>/tmp/00000003.jp2
http://<someserver>/tmp/00000004.jp2
...
and the "hocr_list" would get the URL pointing to a textfile that contains a list of URLs of hOCR instances, like the following, for example:
http://<someserver>/tmp/00000001.txt
http://<someserver>/tmp/00000002.txt
http://<someserver>/tmp/00000003.txt
http://<someserver>/tmp/00000004.txt
...
and the "tess_langmod" is simply a string that indicates the language module that Tesseract Version 3 should use, for example:
"eng"
for the English language module of Tesseract Version 3.
First, the workflow reads the content of the textfiles above and creates a list of URLs. The list of JPEG2000 URLs is passed to the converter that converts them to TIFF files, and the list of hOCR instances is passed to a service that converts them to plain text.
The result of the Tesseract OCR is then compared to the plain text of the hOCR instances.
The following numbers indicate a match measurement between the Tesseract OCR result and the hOCR plain text result:
57.51% |
36.39% |
69.40% |
65.07% |
66.49% |
71.01% |
70.38% |
64.89% |
65.33% |
56.65% |
64.86% |
For the sake of simplicity, let us assume that these numbers correctly indicate the match measurement.
Using these match measurements, we could now create a rule:
If for at least 5 book pages the match measurement falls below 30%, then the book is flagged as potentially problematic.
while the 30% value could be a configurable threshold.
Scenario 2: Compare the OCR of migrated JPEG2000 images with the original TIFF images in order to detect corruption caused by the conversion
In the second scenario we want to find out if we can use OCR in order to detect possible data corruption as a result of a file format migration from JPEG2000 to TIFF.
The following picture shows the original TIFF image on the left, and the corrupted image on the right side.
Using the following Taverna workflow, we applied Tesseract Version 3 OCR to both images and then compared the OCR results.
The OCR result for the original TIFF image varies a lot. Due to some quality deficiencies that are present in the scanned newspaper page, some parts contain more errors, like in the following excerpt, for example:
gf 1_»u:L.41vD. ,
.4-
DUBLIN, Sxrunnn Momuso.
mu ova owv coanmsronnsxm] _
S-WF OF zised to be sold yesterdal, one of which
THE MORNING CHRONICLE, MONDAY, NOVEMBER 17, 1851. _
reigns in the heuu of his taunt luyd
, bg.; "cond to none in I I d _ Y!-I lord, If not the
hum; long and happy lif:e[r:ud'ol?e;¥a]?°d me" md ‘nm
Lord Eau: than role and uid Mr Sh , I thought thu
““‘°' "‘ ‘"“°** YW hm pwposod my health, .ua hkemn
[F Ewcuunsnsr Esrxrns.-There were ten z youu
‘ 3 ‘3'"°° “d I "Nik you for the kmd and tlattenng I
and other parts are of higher quality, like in the following excerpt, for example:
Imsu .wxs or Issvn.-The following return shows an
increase in the amount of note circulation for the four weeks
of £496,984, as compared with the previous retum. During
the four weeks embraced in the return, the bullion in the
banks decreased to the umount. of £36,374 :-
An Account pursuant to the Act. 8 and 9 Vict. cap. 37, of the
Amount ol Bank Notes amhoriaed by luv to be issued by
the several Banks of Issue in Irehmd, and the Avenee
Amount of Bank Notes in circulation, md of Coin he d
during the Four Weeks ending Saturday, the lst dsy of
November, 1851 :-
But if compare this with the result of the OCR of the corrupted image, the difference becomes evident, like in the following excerpt, for example:
-w. »:-:,'§=s;>{}¢_?".;,~»- *ze ° V- . ,»
4., ‘ FU: ' 'Jr 1 __»ir;';g‘ 'F' ° ( 'W os", _tru 'W 'ai' ’\ `¢
1 _, »4<. _,: " _ '' ' . ._ _ _ » _ _
Ulhvr. 9 ` W. . A | it ' _ _ 'N-`€f';fQ g‘f _ ° il: * Q 4 ‘ ¢'|
Q( ' bf, ' ` 1 ,‘ 1'i‘\;.»_:i}¢'('o:"Si‘* iw 5| _ . 4 \ ._ ~ -g _ _ x ' _& » -» 4* :
'F ® {‘|] '1 1 . ¢°‘#\Q 'A:?[, . »r(: ' Pla’ .Vi °‘ ‘ ° /1 ‘_ 'C' , 3|
°" * N* '~ “ - F2 '*%=‘=?2»:f¢‘ ~=<¢m=» »; » » ff; fa ~»-
fb |'»'~|'AH* S 9 ‘°"°»; pf i~T‘3‘ ra,-gg; " #U `° J 3
Tesseract tries to recognize text, and the result is that the comparison returns no match at all for the original TIFF file compared to the JPEG 2000 derivate.
As a conclusion, the difference between the OCR results can be used as an indicator in order to detect possible data corruption during migration.