Skip to end of metadata
Go to start of metadata

Dataset:

Title
10 IDP samples from BL
Description 5 TIF-images (high resolution 5248x6300 pixels) corresponding to 5 JPG-images (low resolution approx .1000x1500 pixels), See http://idp.bl.uk.
Licensing Sample available for use under a BL licence, restricting usage for research only, but otherwise not restricted to SCAPE Project partners. See full licence
Owner British Library
Dataset Location TBD
Collection expert Maureen Pennock (BL). The BL's Newspaper curator was Ed King
Issues brainstorm
  • BL
    • TIF to JPG comparison
List of Issues IS10 Potential bit rot in image files that were stored on CD
IS27 Quality assurance in redownload workflows of digitised books

Issue:

Title
IS10 Potential bit rot in image files that were stored on CD
Detailed description Digitised master image files (TIFFs) from a legacy digitisation project were stored for a number of years on CD. Corresponding service/access images (JPEGs, at a lower resolution, cropped, scale added, colour balanced) were stored on a web server during this period. Consequently there is a higher confidence in the bit integrity of the service copies. Without checksums, the only method of checking the master images for bit rot is to open each one and visually inspect it. The screenshot below shows a master image (on the left) and the service image (on the right).
Also this issue aims at supporting of digital preservation quality assurance. It handles the image based document comparison challenges like detection of differences in file format, color information, scale, rotation, resolution, cropping, slight differences in content.
Scalability Challenge
The volume of the collection is _, there are approximately _ files. Manual QA/checking is not possible due to the volume of the collection, so an automated approach is required.
There are no specific requirements around performance of image based document comparison. It would be nice to have a really large data set in order to check scalability.
Issue champion Maureen Pennock (BL), Digital Preservation Manager. Collection curator is Victoria Swift (BL)
Other interested parties
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Possible Solution approaches
  • If the master and service images were the same, or similar, a simple comparison between them would enable bit rot to be detected. However, the high degree of processing applied to the service images means that they are quite different in appearance to the service images. Fuzzy matching between the images may enable parts of the images to be matched, but image focused approaches may be extremely challenging. OCR based comparison may be possible, although OCR engines may stuggle with hand written chinese characters. This scenario may simply be too challenging to solve!
  • Note that several AQuA Project activities examined similar (if more straightforward) challenges here
  • BL
    • TIF to JPG comparison
  • Austrian National Library
    • Overwriting existing collection items with new items
    • Scalability issue because image pairs can be compaired within a book (book pages can be duplicated or missing) and between different master scans.
    • JPEG2000 profile check
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need or is there value in providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets British Library - International Dunhuang Project Manuscripts
10 IDP samples from BL
Austrian National Library - Digital Book Collection
Solutions SO9 QA for correspondent JP2K comparison for old and new Google book versions (image comparison tool based on bag-of-(visual-)words matching)
SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)
SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario

Solutions:

Title SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching
Detailed description The digital preservation QA command line tool analyzes JP2K images using bag-of-(visual-)words matching method. The tool aims at detecting geometrical distorsions and double or missing pages for duplicate detection within one book or comparison of old and new versions of the Google book. Identification of corresponding images and duplicate/removal/addition detection is supported in this approach. This method requires global dictionary for the whole book.
The difference is measured in [0,1], 0 means most similar, 1 is most different
Solution Champion
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Corresponding Issue(s)
IS10 Potential bit rot in image files that were stored on CD
IS27 Quality assurance in redownload workflows of digitised books
myExperiment Link
TBD
Tool Registry Link
TBD
Evaluation
TBD
Title SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)
Detailed description The digital preservation QA command line tool checks for bit integrity using SIFT-matching method. 
This solution supports a detailed comparison of corresponding images and is based on local descriptor matching.
The detection of missing, double or redundant images in a dataset is provided.
Solution Champion
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Corresponding Issue(s)
IS10 Potential bit rot in image files that were stored on CD
myExperiment Link
TBD
Tool Registry Link
TBD
Evaluation
TBD
Title SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)
Detailed description Detailed comparison of corresponding images is based on local descriptor matching SIFT algorithm, estimation of affine transformation (rotation, scale, translation, shearing) between image pairs, overlaying of images and assessment of structural similarity index SSIM algorithm.
Pixel-wise comparison based on SSIM after estimation of affine transformation of second to first image and overlaying of images is provided by this solution (SSIM=1-black, SSIM=0-white).
The difference between images is measured in [0,1], where 1 means identical and 0 means very different.
The tool is written in C++ and provided as executable using associated DLLs on Windows or shared objects on Linux.
This tool supports detection of structural similarities of images to estimate similarity level of an image pair.
Solution Champion
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Corresponding Issue(s) IS10 Potential bit rot in image files that were stored on CD
IS27 Quality assurance in redownload workflows of digitised books
myExperiment Link
TBD
Tool Registry Link
TBD
Evaluation
TBD
Labels:
scenario scenario Delete
lsdr lsdr Delete
lsdrscenario lsdrscenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.