Title |
Quality assurance in redownload workflows of digitised books |
Detailed description | Cultural heritage institutions such as libraries, museums and archives have been carrying out large scale digitisation projects during the last decade. The production of digitised versions of books or newspapers took either place inhouse or it was outsourced to an external partner. However, even if commercial partners were involved in the production of digital masters, usually, the results and any attached property rights had mostly been transferred entirely to the originator's institution. These circumstances have changed in some public-private partnerships, where the digitisation is carried out by the commercial partner which keeps the original master copy and then produces derivates which are provided to the cultural heritage institution. As a consequence, from the point of view of the cultural heritage institiution, the preservation challenges relate to the derivates rather than the original master copies (considering the very unlikely event that the commercial company disappears together with the digital master copies). This changes an important parameter regarding the use of long term preservation repositories. Instead of producing a master copy once which is stored "forever" in the repository and not supposed to change in future, new derivates of master copies are continuously being made available. The derivates can be downloaded and ingested into the repository as a new version which is either added or replaces the original derivate. In the concrete context of this issue, there are mainly three objects available:
|
Scalability Challenge |
Currently there are about
|
Issue champion | Schlarb Sven![]() |
Other interested parties |
|
Possible Solution approaches |
|
Context | ONB's ABO project (Private public partnership with Google - Google books) |
Lessons Learned | |
Training Needs | |
Datasets | ONB's Google books test data set (Restricted access), 50 selected books with around 30.000 pages (~600 pages per book) sample. |
Solutions | |
Evaluation
Objectives | The objectives are to measure
|
Success criteria |
|
Automatic measures | The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books). |
Manual assessment | A set of book pairs is annotated and considered as gold standard in order to determine the successful application of the quality assurance workflow. The annotation manually assignes a degree of difference for a set of each an original and a redownloaded book as book pairs which is used for evaluating the solution. Based on a threshold for the degree of difference, two classes of "similar" and "different" book pairs are built. The figure below illustrates a sample of 50 books where the first two rows represent book pairs that are classified as "different" (≠), the other rows represent book pairs that are classified as "similar" (=). A possible output of the quality assurance classifier is shown by the red boxes highlighting 11 out of 50 book pairs that are supposed to be different. ![]() On the one hand, the quality assurance classifier detects 8 out of 10 books correctly as "different" (true different), and it misses two books that are "different" but are classified as "similar" (False similar). On the other hand, the classifier detects 5 book pairs as "different" while they are actually "similar" (false different), and it detects in 37 books correctly as "similar" (True similar) according to the gold standard. We use precision and recall as statistical measures where precision= Σ true different / (Σ true different + Σ false different) and recall = Σ true different / (Σ true different + Σ false similar) We then calculate the combined f-measure of precision and recall as f-measure = 2 * (precision * recall) / (precision + recall) This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the precision of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the recall of the tool. And the f-measure expresses the balance between precision and recall. Related to the example above, this means that the classification of the tool would give precision = 8 / ( 8 + 5 ) = 0.61 and recall = 8 / ( 8 + 2 ) = 0.80 which results in the f-measure = 2 * (0.61 * 0.80) / (0.61 + 0.80) = 0.68 |
Actual evaluations | links to actual evaluations of this Issue/Scenario |