Dataset:
Title |
10 IDP samples from BL |
Description | 5 TIF-images (high resolution 5248x6300 pixels) corresponding to 5 JPG-images (low resolution approx .1000x1500 pixels), See http://idp.bl.uk![]() |
Licensing | Sample available for use under a BL licence, restricting usage for research only, but otherwise not restricted to SCAPE Project partners. See full licence![]() |
Owner | British Library |
Dataset Location | TBD |
Collection expert | Maureen Pennock (BL). The BL's Newspaper curator was Ed King |
Issues brainstorm |
|
List of Issues | IS10 Potential bit rot in image files that were stored on CD IS27 Quality assurance in redownload workflows of digitised books |
Issue:
Title |
IS10 Potential bit rot in image files that were stored on CD |
Detailed description | Digitised master image files (TIFFs) from a legacy digitisation project were stored for a number of years on CD. Corresponding service/access images (JPEGs, at a lower resolution, cropped, scale added, colour balanced) were stored on a web server during this period. Consequently there is a higher confidence in the bit integrity of the service copies. Without checksums, the only method of checking the master images for bit rot is to open each one and visually inspect it. The screenshot below shows a master image (on the left) and the service image (on the right). Also this issue aims at supporting of digital preservation quality assurance. It handles the image based document comparison challenges like detection of differences in file format, color information, scale, rotation, resolution, cropping, slight differences in content. |
Scalability Challenge |
The volume of the collection is _, there are approximately _ files. Manual QA/checking is not possible due to the volume of the collection, so an automated approach is required. There are no specific requirements around performance of image based document comparison. It would be nice to have a really large data set in order to check scalability. |
Issue champion | Maureen Pennock![]() |
Other interested parties |
Huber-Mörk Reinhold![]() ![]() ![]() |
Possible Solution approaches |
|
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need or is there value in providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | British Library - International Dunhuang Project Manuscripts 10 IDP samples from BL Austrian National Library - Digital Book Collection |
Solutions | SO9 QA for correspondent JP2K comparison for old and new Google book versions (image comparison tool based on bag-of-(visual-)words matching) SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching) SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm) |
Evaluation
Objectives | Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation |
Success criteria | Describe the success criteria for solving this issue - what are you able to do? - what does the world look like? |
Automatic measures | What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important? If possible specify very specific measures and your goal - e.g. * process 50 documents per second * handle 80Gb files without crashing * identify 99.5% of the content correctly |
Manual assessment | Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue? If possible specify measures and your goal - e.g. * Solution installable with basic linux system administration skills * User interface understandable by non developer curators |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Solutions:
Title | SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching |
Detailed description | The digital preservation QA command line tool analyzes JP2K images using bag-of-(visual-)words matching method. The tool aims at detecting geometrical distorsions and double or missing pages for duplicate detection within one book or comparison of old and new versions of the Google book. Identification of corresponding images and duplicate/removal/addition detection is supported in this approach. This method requires global dictionary for the whole book. The difference is measured in [0,1], 0 means most similar, 1 is most different |
Solution Champion |
Huber-Mörk Reinhold![]() ![]() ![]() |
Corresponding Issue(s) |
IS10 Potential bit rot in image files that were stored on CD IS27 Quality assurance in redownload workflows of digitised books |
myExperiment Link |
TBD |
Tool Registry Link |
TBD |
Evaluation |
TBD |
Title | SO10 QA for TIFF to correspondent JP2K comparison (image comparison tool based on SIFT-matching) |
Detailed description | The digital preservation QA command line tool checks for bit integrity using SIFT-matching method. This solution supports a detailed comparison of corresponding images and is based on local descriptor matching. The detection of missing, double or redundant images in a dataset is provided. |
Solution Champion |
Huber-Mörk Reinhold![]() ![]() ![]() |
Corresponding Issue(s) |
IS10 Potential bit rot in image files that were stored on CD |
myExperiment Link |
TBD |
Tool Registry Link |
TBD |
Evaluation |
TBD |
Title | SO16 QA for estimation of affine transformation (image comparison tool based on SSIM algorithm) |
Detailed description | Detailed comparison of corresponding images is based on local descriptor matching SIFT algorithm, estimation of affine transformation (rotation, scale, translation, shearing) between image pairs, overlaying of images and assessment of structural similarity index SSIM algorithm. Pixel-wise comparison based on SSIM after estimation of affine transformation of second to first image and overlaying of images is provided by this solution (SSIM=1-black, SSIM=0-white). The difference between images is measured in [0,1], where 1 means identical and 0 means very different. The tool is written in C++ and provided as executable using associated DLLs on Windows or shared objects on Linux. This tool supports detection of structural similarities of images to estimate similarity level of an image pair. |
Solution Champion |
Huber-Mörk Reinhold![]() ![]() ![]() |
Corresponding Issue(s) | IS10 Potential bit rot in image files that were stored on CD IS27 Quality assurance in redownload workflows of digitised books |
myExperiment Link |
TBD |
Tool Registry Link |
TBD |
Evaluation |
TBD |