Skip to end of metadata
Go to start of metadata
Title
IS10 Potential bit rot in image files that were stored on CD
Detailed description Digitised master image files (TIFFs) from a legacy digitisation project were stored for a number of years on CD. Corresponding service/access images (JPEGs, at a lower resolution, cropped, scale added, colour balanced) were stored on a web server during this period. Consequently there is a higher confidence in the bit integrity of the service copies. Without checksums, the only method of checking the master images for bit rot is to open each one and visually inspect it. The screenshot below shows a master image (on the left) and the service image (on the right).
Also this issue aims at supporting of digital preservation quality assurance. It handles the image based document comparison challenges like detection of differences in file format, color information, scale, rotation, resolution, cropping, slight differences in content.
Scalability Challenge
The volume of the collection is _, there are approximately _ files. Manual QA/checking is not possible due to the volume of the collection, so an automated approach is required.
There are no specific requirements around performance of image based document comparison. It would be nice to have a really large data set in order to check scalability.
Issue champion Maureen Pennock (BL), Digital Preservation Manager. Collection curator is Victoria Swift (BL)
Other interested parties
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Possible Solution approaches
  • If the master and service images were the same, or similar, a simple comparison between them would enable bit rot to be detected. However, the high degree of processing applied to the service images means that they are quite different in appearance to the service images. Fuzzy matching between the images may enable parts of the images to be matched, but image focused approaches may be extremely challenging. OCR based comparison may be possible, although OCR engines may stuggle with hand written chinese characters. This scenario may simply be too challenging to solve!
  • Note that several AQuA Project activities examined similar (if more straightforward) challenges here
  • BL
    • TIF to JPG comparison
  • Austrian National Library
    • Overwriting existing collection items with new items
    • Scalability issue because image pairs can be compaired within a book (book pages can be duplicated or missing) and between different master scans.
    • JPEG2000 profile check
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need or is there value in providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets British Library - International Dunhuang Project Manuscripts
10 IDP samples from BL
Austrian National Library - Digital Book Collection
Solutions SO9 QA for correspondent JP2K comparison for old and new Google book versions (image comparison tool based on bag-of-(visual-)words matching)
SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)
SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
lsdr lsdr Delete
qa qa Delete
issue issue Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.