||IS10 Potential bit rot in image files that were stored on CD|
|Detailed description|| Digitised master image files (TIFFs) from a legacy digitisation project were stored for a number of years on CD. Corresponding service/access images (JPEGs, at a lower resolution, cropped, scale added, colour balanced) were stored on a web server during this period. Consequently there is a higher confidence in the bit integrity of the service copies. Without checksums, the only method of checking the master images for bit rot is to open each one and visually inspect it. The screenshot below shows a master image (on the left) and the service image (on the right).
Also this issue aims at supporting of digital preservation quality assurance. It handles the image based document comparison challenges like detection of differences in file format, color information, scale, rotation, resolution, cropping, slight differences in content.
| Scalability Challenge
|| The volume of the collection is _, there are approximately _ files. Manual QA/checking is not possible due to the volume of the collection, so an automated approach is required.
There are no specific requirements around performance of image based document comparison. It would be nice to have a really large data set in order to check scalability.
|Issue champion||Maureen Pennock (BL), Digital Preservation Manager. Collection curator is Victoria Swift (BL)|
| Other interested parties
||Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)|
|Possible Solution approaches||
|Context|| Details of the institutional context to the Issue. (May be expanded at a later date)
|Lessons Learned|| Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
|Training Needs|| Is there a need or is there value in providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
|Datasets|| British Library - International Dunhuang Project Manuscripts
10 IDP samples from BL
Austrian National Library - Digital Book Collection
|Solutions|| SO9 QA for correspondent JP2K comparison for old and new Google book versions (image comparison tool based on bag-of-(visual-)words matching)
SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)
SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)
|Objectives||Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation|
|Success criteria||Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?|
|Automatic measures|| What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* process 50 documents per second
* handle 80Gb files without crashing
* identify 99.5% of the content correctly
|Manual assessment|| Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
* Solution installable with basic linux system administration skills
* User interface understandable by non developer curators
|Actual evaluations||links to acutual evaluations of this Issue/Scenario|