|One line summary|| The intention of this solution was to compare two OCR results where the images that are OCRed have two different formats, one is the original TIFF file, the other one is a JP2 (JPEG 2000) representation of this TIFF file. The goal was to find out if the format conversion has an influence on the ocr result.
|Detailed description|| The diagram below gives an overview to the workflow that has been developed for this purpose:
The input parameters text2_url and text2_url are URL references to images that should be OCRed, in this case one being the original TIFF image, the other one the JP2 derivate of the same TIFF image. The parameter langcs is used to indicate the dictionaries that should be used by the OCR engine, the outputFormat is set to Text (UTF-8 plain text), and the inputTextType is set to Normal which means that no special font module should be used, given english source material the English dictionary of the Abbyy OCR engine was used. We have then two times the IMPACT_ABBYY_FineReader_10_OCR from Abbyy as the OCR engine which OCR the two image instances, only the JP2 image decompressed by the convert operation to TIFF using a OpenJPEG conversion service.
In order to be able to pass the JP2 file to the OCR engine, it was first decompressed to TIFF using OpenJPEG. Finally a comparison component calculated an indicator for the text difference based on the Levenshtein distance measure.
For calculating the Levenshtein difference, the apache commons Java library (commons-lang-2.4.jar) was used.
A Taverna beanshell which interpretes a Java scripting language at runtime, was used to calculate the Levenshtein distance. In order to be able to use the external apache commons Java library, it had to be made available to the Taverna Workbench by dropping it in a folder which is indicated in the dependencies tab space of the bean configuration. The following Java snippet could then be used in the beanshell to calculate the difference:
double ld = StringUtils.getLevenshteinDistance(text1, text2);
double avglen = ((double)text1.length()+(double)text2.length())/2.0;
double m = 1.0-(ld/avglen);double normVal = (m<0)?0.0:m;
normalized_levenshtein_distance = normVal;
Using the org.apache.commons.lang.StringUtils.getLevenshteinDistance(text1, text2) method, it determines the Levenstein distance and relates it to the mean length of the two texts that are compared ((double)(text1.length()+text2.length())/2.0) and calculates a normalized similarity measure where 0 means no similarity and 1 means absolute equality.
We assumed that the character level distance is in this case a sufficient indicator for the difference of the text material because the good quality of the images allowed OCR results having only very few character errors.
The result was that the OCR result for the JP2 file was slightly better in several tests compared to the original TIFF file, while the difference was only 1 to 2 characters and the text difference measure indicated a match between the results above 99%.
Due to license restrictions, it is not possible to make the whole workflow available because it was using the commercial ABBYY OCR engine.
The beanshell for calculating the text difference hast been made available on my experiment. Still, it would be easily possible to replace the ABBYY OCR services by a service based on the Tesseract OCR engine.
|Solution champion|| Sven Schlarb <[email protected]>
|Tool (link)||Taverna Workflow Design and Execution workbench, OpenJPEG, Tesseract OCR|
Skip to end of metadata Go to start of metadata