View Source

h2. Detecting duplicates on large collections of digitized book pages

* Context
** ONB is doing a mass digitalizing of handwritten/multilingual/deprecated language books (100.000 books).
* Preservation Issues
** Sometimes pages get digitized twice because the automated process get it wrong (within repetition).
** Two instances of the same book are digitized and we have to detect this duplicates (book repetition).
** Detecting duplicated pages by hand is error prone and its impossible because of the large scale of the problem.
* SCAPE solution (unique selling point)
** MatchBox is able to detect duplicates on collections of image files.
** Humans can then make decisions on which version of the duplicate they want to keep.
* Success Story Champion
** Reinhold (AIT)



h2. Context

Europe has invested millions of euros in mass digitization projects. Such projects are error prone in the sense that sometimes book pages are digitized more than once (due to the particularities of the digitization process) or entire books appear duplicated because they belong to multiple book collections at the same time.

Duplicates are not exact copies of each other as they result from distinct digitization processes. Images may be skewed, different color tones, rotated, cropped, etc. 

Additionally, post processing jobs on the digitized images produce results that are do not comply to the quality standards of the collection owner or eliminate images in the process.

This means that traditional duplicate image detection techniques will not work on these scenarios. OCR based techniques will not work either because books are sometimes handwritten or written in a deprecated language.

Manual approaches tend to be imprecise and time consuming so a tool that automates this process in an accurate way is necessary.

SCAPE has developed MatchBox, which uses an innovative approach to address this problem. MatchBox applies state of the art image processing approaches to the domain of digital preservation and quality assurance. Based on computer vision algorithms selecting key characteristics of the images and extracting information from their content, large digitized collections can be processed in a scalable way. 







h2. Preservation issues



h2. Solution