Skip to end of metadata
Go to start of metadata
Title
Quality assurance in redownload workflows of digitised books
Detailed description Cultural heritage institutions such as libraries, museums and archives have been carrying out large scale digitisation projects during the last decade. The production of digitised versions of books or newspapers took either place inhouse or it was outsourced to an external partner. However, even if commercial partners were involved in the production of digital masters, usually, the results and any attached property rights had mostly been transferred entirely to the originator's institution.

These circumstances have changed in some public-private partnerships, where the digitisation is carried out by the commercial partner which keeps the original master copy and then produces derivates which are provided to the cultural heritage institution. As a consequence, from the point of view of the cultural heritage institiution, the preservation challenges relate to the derivates rather than the original master copies (considering the very unlikely event that the commercial company disappears together with the digital master copies).

This changes an important parameter regarding the use of long term preservation repositories. Instead of producing a master copy once which is stored "forever" in the repository and not supposed to change in future, new derivates of master copies are continuously being made available. The derivates can be downloaded and ingested into the repository as a new version which is either added or replaces the original derivate.

In the concrete context of this issue, there are mainly three objects available:
  1. A METS container for each book item.
  2. A series of digital images (JPEG2000) for each page of the book.
  3. An hOCR file containing text and layout information from the OCR.
    All the three object types provide information that can be used in a quality assurance process that helps to determine if a new derivate is better in terms of quality compared to previous versions. First, images can be used for image analysis and comparison, context information from the METS file can be taken to compare images from one book (possibly duplicated pages) or from different versions against each other. Second, hOCR files can be used for doing text content and layout analysis. Finally, a hybrid approach, using image comparision and text/layout analysis can be used.
Scalability Challenge
Currently there are about
  • 50.000 books (at least 320 pages each)
  • 16 Mio pages (one image and one hOCR file each).
    For example, a simple compression process that takes 2 seconds for each hOCR file would last 185 days on one single processing node.
Issue champion Schlarb Sven (ONB).
Other interested parties
 
Possible Solution approaches
  • Text mining for detecting significant changes between stored derivate and new derivate.
  • Comparing old and new instances of the corresponding items (books and/or book pages).
Context ONB's ABO project (Private public partnership with Google - Google books)
Lessons Learned
Training Needs
Datasets ONB's Google books test data set (Restricted access), 50 selected books with around 30.000 pages (~600 pages per book) sample.
Solutions

Evaluation

Objectives The objectives are to measure
  • scalability in terms of throughput (books/time) related to defined quality assurance workflows with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books).
  • reliability in terms of error free processing of defined quality assurance workflows.
  • preciseness in terms of the number of pages and books correctly identified.
Success criteria
  • Hadoop data preparation (e.g. loading data into HDFS) and quality assurance workflows must be processable in a reasonable amount of time (scalability in terms of throughput).
    "Reasonable" in this context is a flexible criterium that means "it should rather take days instead of weeks" to process a large book collection like the data set of 50.000 books. To be more concrete, loading all hOCR instances for further analysis into the distributed file system (HDFS) would normally be done once a year depending on the amount of redownloaded items. Given that the number of books will increase over the coming years, loading the data into HDFS should not take longer than one week . The quality assurance workflow should not take longer than around three weeks to run on the complete collection.
  • The process should not fail without any reasonable cause, processing errors are explicitely handled and reported by the system (reliability).
  • The rate of book pairs which are detected as being significantly different is high enough to be used as a tool for manual quality control. "High enough" is measured in terms of sensitivity and specificity thresholds in relation to the evaluation data set like explained below (see Manual assessment).
Automatic measures The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books).
Manual assessment A set of book pairs is annotated and considered as gold standard in order to determine the successful application of the quality assurance workflow. The annotation manually assignes a degree of difference for a set of each an original and a redownloaded book as book pairs which is used for evaluating the solution. Based on a threshold for the degree of difference, two classes of "similar" and "different" book pairs are built. The figure below illustrates a sample of 50 books where the first two rows represent book pairs that are classified as "different" (), the other rows represent book pairs that are classified as "similar" (=). A possible output of the quality assurance classifier is shown by the red boxes highlighting 11 out of 50 book pairs that are supposed to be different.



On the one hand, the quality assurance classifier detects 8 out of 10 books correctly as "different" (true different), and it misses two books that are "different" but are classified as "similar" (False similar). On the other hand, the classifier detects 5 book pairs as "different" while they are actually "similar" (false different), and it detects in 37 books correctly as "similar" (True similar) according to the gold standard.

We use precision and recall as statistical measures where

precision= Σ true different / (Σ true different + Σ false different)

and

recall = Σ true different / (Σ true different + Σ false similar)
We then calculate the combined f-measure of precision and recall as
f-measure = 2 * (precision * recall) / (precision + recall)

This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the precision of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the recall of the tool. And the f-measure expresses the balance between precision and recall.

Related to the example above, this means that the classification of the tool would give 
precision = 8 / ( 8 +  5 ) = 0.61
and

recall = 8 / ( 8 + 2 ) = 0.80
which results in the

f-measure = 2 * (0.61 * 0.80) / (0.61 + 0.80) = 0.68
Actual evaluations links to actual evaluations of this Issue/Scenario
Labels:
qa qa Delete
lsdr lsdr Delete
issue issue Delete
planning planning Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.