There are three goals for the OCR Quality working group:
- Develop Hadoop job which will assess OCR quality on word level
- Develop Hadoop job which will correct common errors in OCR
- Develop Hadoop job which will create full text search index from the OCR results
We aim to have the at least the first one complete during the hackathon, as most of the group members have no experience with Hadoop.
For the time being the working plan for the group is as follows:
- We will use the dataset available on virtual machines provided by organisers
- We will use historical dictionaries from the IMPACT Centre of Competence to assess the quality of OCR
- We will add simple conditions for assessing OCR quality (e.g. words with non-alphanumerical characters are wrongly recognised)
- We will have granularity on a file level
- Because we want to learn Hadoop and Pig we will have two alternative solutions - one written in Pig and the other in Java (pure Hadoop).