Compare OCR results of the same source material in different formats (TIFF, JP2)

Skip to end of metadata
Go to start of metadata
One line summary The intention of this solution was to compare two OCR results where the images that are OCRed have two different formats, one is the original TIFF file, the other one is a JP2 (JPEG 2000) representation of this TIFF file. The goal was to find out if the format conversion has an influence on the ocr result.
Detailed description The diagram below gives an overview to the workflow that has been developed for this purpose:



The input parameters text2_url and text2_url are URL references to images that should be OCRed, in this case one being the original TIFF image, the other one the JP2 derivate of the same TIFF image. The parameter langcs is used to indicate the dictionaries that should be used by the OCR engine, the outputFormat is set to Text (UTF-8 plain text), and the inputTextType is set to Normal which means that no special font module should be used, given english source material the English dictionary of the Abbyy OCR engine was used. We have then two times the IMPACT_ABBYY_FineReader_10_OCR from Abbyy as the OCR engine which OCR the two image instances, only the JP2 image decompressed by the convert operation to TIFF using a OpenJPEG conversion service.
In order to be able to pass the JP2 file to the OCR engine, it was first decompressed to TIFF using OpenJPEG. Finally a comparison component calculated an indicator for the text difference based on the Levenshtein distance measure.

For calculating the Levenshtein difference, the apache commons Java library (commons-lang-2.4.jar) was used.
A Taverna beanshell which interpretes a Java scripting language at runtime, was used to calculate the Levenshtein distance. In order to be able to use the external apache commons Java library, it had to be made available to the Taverna Workbench by dropping it in a folder which is indicated in the dependencies tab space of the bean configuration. The following Java snippet could then be used in the beanshell to calculate the difference:

import org.apache.commons.lang.StringUtils;
double ld = StringUtils.getLevenshteinDistance(text1, text2);
double avglen = ((double)text1.length()+(double)text2.length())/2.0;
double m = 1.0-(ld/avglen);double normVal = (m<0)?0.0:m;
normalized_levenshtein_distance = normVal;

Using the org.apache.commons.lang.StringUtils.getLevenshteinDistance(text1, text2) method, it determines the Levenstein distance and relates it to the mean length of the two texts that are compared ((double)(text1.length()+text2.length())/2.0) and calculates a normalized similarity measure where 0 means no similarity and 1 means absolute equality.
We assumed that the character level distance is in this case a sufficient indicator for the difference of the text material because the good quality of the images allowed OCR results having only very few character errors. 

The result was that the OCR result for the JP2 file was slightly better in several tests compared to the original TIFF file, while the difference was only 1 to 2 characters and the text difference measure indicated a match between the results above 99%.

Due to license restrictions, it is not possible to make the whole workflow available because it was using the commercial ABBYY OCR engine.

The beanshell for calculating the text difference hast been made available on my experiment. Still, it would be easily possible to replace the ABBYY OCR services by a service based on the Tesseract OCR engine.
Solution champion Sven Schlarb <shsschlarb-aqua@yahoo.de>
myExperiment link http://www.myexperiment.org/workflows/2175.
Evaluation
  • The workflow shows the basic concepts for setting up an OCR evaluation experiment.
  • In a real evaluation scenario, additionally to the character level evaluation it would be important to also add word level evaluation.
  • In order to better evaluate the OCR results, the OCR results coming from the two images should be evaluated against an absolutely correct double-keyed text representation of the image instead of simply comparing the OCR results against each other.
  • One further interesting development line to follow would be to have not only two, but various more evaluation strands where different JP2 encoding options can be evaluated in order to determine the optimal encoding parameters for OCR, e.g. determining the maximum JPEG 2000 compression rate that does not influence the OCR result.
Tool (link) Taverna Workflow Design and Execution workbench, OpenJPEG, Tesseract OCR
Labels:
ocr ocr Delete
jp2 jp2 Delete
jpeg2000 jpeg2000 Delete
levenshtein levenshtein Delete
solution solution Delete
aqua aqua Delete
quality_assurance quality_assurance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.