View Source

| *Title* \\ | IS10 Potential bit rot in image files that were stored on CD |
| *Detailed description* | Digitised master image files (TIFFs) from a legacy digitisation project were stored for a number of years on CD. Corresponding service/access images (JPEGs, at a lower resolution, cropped, scale added, colour balanced) were stored on a web server during this period. Consequently there is a higher confidence in the bit integrity of the service copies. Without checksums, the only method of checking the master images for bit rot is to open each one and visually inspect it. The screenshot below shows a master image (on the left) and the service image (on the right). \\
Also this issue aims at supporting of digital preservation quality assurance. It handles the image based document comparison challenges like detection of differences in file format, color information, scale, rotation, resolution, cropping, slight differences in content. |
| *Scalability Challenge* \\ | The volume of the collection is \_, there are approximately _ files. Manual QA/checking is not possible due to the volume of the collection, so an automated approach is required. \\
There are no specific requirements around performance of image based document comparison. It would be nice to have a really large data set in order to check scalability. |
| *[Issue champion|SP:Responsibilities of the roles described on these pages]* | [Maureen Pennock|] (BL), Digital Preservation Manager. Collection curator is Victoria Swift (BL) |
| *Other interested parties* \\ | [Huber-Mörk Reinhold|] (AIT), [Schindler Alexander|] (AIT), [Graf Roman|] (AIT) |
| *Possible Solution approaches* | * If the master and service images were the same, or similar, a simple comparison between them would enable bit rot to be detected. However, the high degree of processing applied to the service images means that they are quite different in appearance to the service images. Fuzzy matching between the images may enable parts of the images to be matched, but image focused approaches may be extremely challenging. OCR based comparison may be possible, although OCR engines may stuggle with hand written chinese characters. This scenario may simply be too challenging to solve\!
* Note that several AQuA Project activities examined similar (if more straightforward) challenges [here|AQuA:Image Issues] \\
* BL
** TIF to JPG comparison
* Austrian National Library
** Overwriting existing collection items with new items
** Scalability issue because image pairs can be compaired within a book (book pages can be duplicated or missing) and between different master scans.
** JPEG2000 profile check |
| *Context* | _Details of the institutional context to the Issue. (May be expanded at a later date)_ \\ |
| *Lessons Learned* | _Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)_ \\ |
| *Training Needs* | _Is there a need or is there value in providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP._ \\ |
| *Datasets* | [British Library - International Dunhuang Project Manuscripts|British Library - International Dunhuang Project Manuscripts]\\
[10 IDP samples from BL] \\
[Austrian National Library - Digital Book Collection|Austrian National Library - Digital Book Collection] |
| *Solutions* | [SO9 QA for correspondent JP2K comparison for old and new Google book versions (image comparison tool based on bag-of-(visual-)words matching)|SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching] \\
[SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)|SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching)] \\
[SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)|SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm)] |


h1. Evaluation

| *Objectives* | _Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation_ |
| *Success criteria* | _Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?_ |
| *Automatic measures* | _What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?_ \\
_If possible specify very specific measures and your goal - e.g._ \\
_ \* process 50 documents per second_ \\
_ \* handle 80Gb files without crashing_ \\
_ \* identify 99.5% of the content correctly_ \\ |
| *Manual assessment* | _Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?_ \\
_If possible specify measures and your goal - e.g._ \\
_ \* Solution installable with basic linux system administration skills_ \\
_ \* User interface understandable by non developer curators_ \\ |
| *Actual evaluations* | links to acutual evaluations of this Issue/Scenario |