Dataset:
Title |
Austrian National Library - Digital Book Collection |
Description | The Austrian Books online project of the Austrian National Library is currently creating a digital book collection from the 16th to the 19th century of about 600.000 books that will be digitised over the coming years. |
Licensing | Sample only available to SCAPE partners. |
Owner | Austrian National Library |
Dataset Location | Austrian National Library |
Issue:
Title |
Book page image duplicate detection within one book |
Detailed description | Cultural heritage institutions such as libraries, museums and archives have been carrying out large scale digitisation projects during the last decade. Due to specific processes in a digital book production process (e.g. different scanning sources, various book page image versions, etc.), it can occur that book image duplicates are introduced into the compiled version of a digital book. The issue presented here is the need to identify books within a large digital book collection that contain duplicated book pages, and to know which book page images are actually duplicate book page image pairs. |
Scalability Challenge |
Currently there are about
|
Issue champion | Schlarb Sven![]() |
Other interested parties |
|
Possible Solution approaches |
|
Context | ONB's ABO project (Private public partnership with Google - Google books) |
Lessons Learned | |
Training Needs | |
Datasets | Representative Dataset: ONB's Google books test data set (Restricted access), 50 selected books with around 30.000 pages (~600 pages per book) sample. |
Large Dataset: ONB's Google books data set (Restricted access), about 50.000 books with around 28 Million pages. | |
Solutions | |
Evaluation
Objectives | The objectives are to measure
|
Success criteria |
|
Automatic measures | The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books). |
Manual assessment | A set of books is chosen and the book page image duplicates are listed as book page images which have to be identified as n-tuples (duplicates, triplicates, etc.). This goal is considered as gold standard in order to determine the successful application of the book page image duplicate detection workflow. The evaluation process first creates the book page image duplicate detection workflow output and ground truth lists of book image pairs. It then counts each page tuple from the matchbox output that is in the ground truth as correctly identified tuple (true similar). Those that are not in the ground truth are counted as incorrectly identified tuples (false similar), and finally, those that are in the ground truth but not in the book page image duplicate detection workflow output are counted as missed tuples (false different). The precision is then calculated as the number of true positives (i.e. the number of items correctly labeled as duplicate page pairs) divided by the total number of elements assumed to be duplicate page pairs (i.e. the sum of true positives and false positives, which are items incorrectly labeled as being duplicate page pairs ). Recall is then defined as the number of true positives divided by the total number of elements of duplicate page pairs (i.e. the sum of true positives and false negatives, which are items have not been labeled as being duplicate page pairs but actually should have been). The ground truth contains single page instances without duplicates and n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are expanded, the result is a list of 2-tuples which is used to determine the number of missed duplicates (false negatives). Let's assume that the book page image duplicates classifier detects 8 out of 10 tuples correctly as "similar" (true similar), and it misses two tuples that are "different" but are classified as "similar" (false different). On the other hand, the book page image duplicates classifier detects 5 tuples as "similar" while they are actually "different" (false similar). We use precision and recall as statistical measures where precision= Σ true similar / (Σ true similar + Σ false similar) and recall = Σ true similar / (Σ true similar + Σ false different) We then calculate the combined f-measure of precision and recall as f-measure = 2 * (precision * recall) / (precision + recall) This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the precision of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the recall of the tool. And the f-measure expresses the balance between precision and recall. Related to the example above, this means that the classification of the tool would give precision = 8 / ( 8 + 5 ) = 0.61 and recall = 8 / ( 8 + 2 ) = 0.80 which results in the f-measure = 2 * (0.61 * 0.80) / (0.61 + 0.80) = 0.68 |
Actual evaluations | links to actual evaluations of this Issue/Scenario |
Solutions:
Title | SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching |
Detailed description | The digital preservation QA command line tool analyzes JP2K images using bag-of-(visual-)words matching method. The tool aims at detecting geometrical distorsions and double or missing pages for duplicate detection within one book or comparison of old and new versions of the Google book. Identification of corresponding images and duplicate/removal/addition detection is supported in this approach. This method requires global dictionary for the whole book. The difference is measured in [0,1], 0 means most similar, 1 is most different |
Solution Champion |
Huber-Mörk Reinhold![]() ![]() ![]() |
Corresponding Issue(s) |
IS10 Potential bit rot in image files that were stored on CD IS27 Quality assurance in redownload workflows of digitised books |
myExperiment Link |
TBD |
Tool Registry Link |
TBD |
Evaluation |
TBD |