Title |
Book page image duplicate detection within one book |
Detailed description | Cultural heritage institutions such as libraries, museums and archives have been carrying out large scale digitisation projects during the last decade. Due to specific processes in a digital book production process (e.g. different scanning sources, various book page image versions, etc.), it can occur that book image duplicates are introduced into the compiled version of a digital book. The issue presented here is the need to identify books within a large digital book collection that contain duplicated book pages, and to know which book page images are actually duplicate book page image pairs. |
Scalability Challenge |
Currently there are about
|
Issue champion | Schlarb Sven![]() |
Other interested parties |
|
Possible Solution approaches |
|
Context | ONB's ABO project (Private public partnership with Google - Google books) |
Lessons Learned | |
Training Needs | |
Datasets | Representative Dataset: ONB's Google books test data set (Restricted access), 50 selected books with around 30.000 pages (~600 pages per book) sample. |
Large Dataset: ONB's Google books data set (Restricted access), about 50.000 books with around 28 Million pages. | |
Solutions | |
Evaluation
Objectives | The objectives are to measure
|
Success criteria |
|
Automatic measures | The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books). |
Manual assessment | A set of books is chosen and the book page image duplicates are listed as book page images which have to be identified as n-tuples (duplicates, triplicates, etc.). This goal is considered as gold standard in order to determine the successful application of the book page image duplicate detection workflow. The evaluation process first creates the book page image duplicate detection workflow output and ground truth lists of book image pairs. It then counts each page tuple from the matchbox output that is in the ground truth as correctly identified tuple (true similar). Those that are not in the ground truth are counted as incorrectly identified tuples (false similar), and finally, those that are in the ground truth but not in the book page image duplicate detection workflow output are counted as missed tuples (false different). The precision is then calculated as the number of true positives (i.e. the number of items correctly labeled as duplicate page pairs) divided by the total number of elements assumed to be duplicate page pairs (i.e. the sum of true positives and false positives, which are items incorrectly labeled as being duplicate page pairs ). Recall is then defined as the number of true positives divided by the total number of elements of duplicate page pairs (i.e. the sum of true positives and false negatives, which are items have not been labeled as being duplicate page pairs but actually should have been). The ground truth contains single page instances without duplicates and n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are expanded, the result is a list of 2-tuples which is used to determine the number of missed duplicates (false negatives). Let's assume that the book page image duplicates classifier detects 8 out of 10 tuples correctly as "similar" (true similar), and it misses two tuples that are "different" but are classified as "similar" (false different). On the other hand, the book page image duplicates classifier detects 5 tuples as "similar" while they are actually "different" (false similar). We use precision and recall as statistical measures where precision= Σ true similar / (Σ true similar + Σ false similar) and recall = Σ true similar / (Σ true similar + Σ false different) We then calculate the combined f-measure of precision and recall as f-measure = 2 * (precision * recall) / (precision + recall) This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the precision of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the recall of the tool. And the f-measure expresses the balance between precision and recall. Related to the example above, this means that the classification of the tool would give precision = 8 / ( 8 + 5 ) = 0.61 and recall = 8 / ( 8 + 2 ) = 0.80 which results in the f-measure = 2 * (0.61 * 0.80) / (0.61 + 0.80) = 0.68 |
Actual evaluations | links to actual evaluations of this Issue/Scenario |