Skip to end of metadata
Go to start of metadata
Title
Book page image duplicate detection within one book
Detailed description Cultural heritage institutions such as libraries, museums and archives have been carrying out large scale digitisation projects during the last decade.

Due to specific processes in a digital book production process (e.g. different scanning sources, various book page image versions, etc.), it can occur that book image duplicates are introduced into the compiled version of a digital book.

The issue presented here is the need to identify books within a large digital book collection that contain duplicated book pages, and to know which book page images are actually duplicate book page image pairs.
Scalability Challenge
Currently there are about
  • 50.000 books (at least 320 pages each)
  • 16 Mio pages (one image and one hOCR file each).
    For example, a simple compression process that takes 2 seconds for each hOCR file would last 185 days on one single processing node.
Issue champion Schlarb Sven (ONB).
Other interested parties
 
Possible Solution approaches
  • Comparing book pages within one book.
Context ONB's ABO project (Private public partnership with Google - Google books)
Lessons Learned
Training Needs
Datasets Representative Dataset: ONB's Google books test data set (Restricted access), 50 selected books with around 30.000 pages (~600 pages per book) sample.
Large Dataset: ONB's Google books data set (Restricted access), about 50.000 books with around 28 Million pages.  
Solutions

Evaluation

Objectives The objectives are to measure
  • scalability in terms of throughput (books/time) related to defined quality assurance workflows with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books).
  • reliability in terms of error free processing of defined quality assurance workflows.
  • preciseness in terms of the number of pages and books correctly identified.
Success criteria
  • Hadoop data preparation (e.g. loading data into HDFS) and quality assurance workflows must be processable in a reasonable amount of time (scalability in terms of throughput).
    "Reasonable" in this context is a flexible criterium that means "it should rather take days instead of weeks" to process a large book collection like the data set of 50.000 books. To be more concrete, loading all hOCR instances for further analysis into the distributed file system (HDFS) would normally be done once a year depending on the amount of redownloaded items. Given that the number of books will increase over the coming years, loading the data into HDFS should not take longer than one week . The quality assurance workflow should not take longer than around three weeks to run on the complete collection.
  • The process should not fail without any reasonable cause, processing errors are explicitely handled and reported by the system (reliability).
  • The rate of book pairs which are detected as being significantly different is high enough to be used as a tool for manual quality control. "High enough" is measured in terms of sensitivity and specificity thresholds in relation to the evaluation data set like explained below (see Manual assessment).
Automatic measures The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books).
Manual assessment A set of books is chosen and the book page image duplicates are listed as book page images which have to be identified as n-tuples (duplicates, triplicates, etc.). This goal is considered as gold standard in order to determine the successful application of the book page image duplicate detection workflow.

The evaluation process first creates the book page image duplicate detection workflow output and ground truth lists of book image pairs.

It then counts each page tuple from the matchbox output that is in the ground truth as correctly identified tuple (true similar). Those that are not in the ground truth are counted as incorrectly identified tuples (false similar), and finally, those that are in the ground truth but not in the book page image duplicate detection workflow output are counted as missed tuples (false different). The precision is then calculated as the number of true positives (i.e. the  number of items correctly labeled as duplicate page pairs) divided by the  total number of elements assumed to be duplicate page pairs (i.e. the sum of  true positives and false positives, which are items incorrectly labeled as  being duplicate page pairs ). Recall is then defined as the number of  true positives divided by the total number of elements of duplicate page  pairs (i.e. the sum of true positives and false negatives, which are items  have not been labeled as being duplicate page pairs but actually should have  been). The ground truth contains single page instances without duplicates and  n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are  expanded, the result is a list of 2-tuples which is used to determine the number of missed duplicates (false negatives).

Let's assume that the book page image duplicates classifier detects 8 out of 10 tuples correctly as "similar" (true similar), and it misses two tuples that are "different" but are classified as "similar" (false different). On the other hand, the book page image duplicates classifier detects 5 tuples as "similar" while they are actually "different" (false similar).

We use precision and recall as statistical measures where

precision= Σ true similar / (Σ true similar + Σ false similar)

and

recall = Σ true similar / (Σ true similar + Σ false different)
 
We then calculate the combined f-measure of precision and recall as
f-measure = 2 * (precision * recall) / (precision + recall)

This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the precision of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the recall of the tool. And the f-measure expresses the balance between precision and recall.

Related to the example above, this means that the classification of the tool would give 
precision = 8 / ( 8 +  5 ) = 0.61
and

recall = 8 / ( 8 + 2 ) = 0.80
which results in the

f-measure = 2 * (0.61 * 0.80) / (0.61 + 0.80) = 0.68
Actual evaluations links to actual evaluations of this Issue/Scenario
Labels:
issue issue Delete
untagged untagged Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.