Dataset:
Title |
JISC1 19th Century Digitised Newspapers |
Description | The collection consists of 2.2million pages of digitised 19th Century Newspapers. Content includes:
|
Licensing | The collection sample is available for use under a BL licence, restricting usage for research only. Otherwise it is not restricted to SCAPE Project partners. See full licence![]() |
Owner | British Library |
Dataset Location |
TBC |
Collection expert | TBC |
Issues brainstorm | |
List of issues | IS44 QA of migrated images IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? IS1 Digitised TIFFs do not meet storage and access requirements |
Issue:
Title |
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? |
Detailed description | Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile". If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner. The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile" Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE. Example 1: JISC1 Newspapers Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers). Example 2: Brightsolid Newspapers Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page. |
Scalability Challenge |
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks). |
Issue champion | Maureen Pennock![]() |
Other interested parties |
Sven Schlarb![]() Christy Henshaw (Wellcome Library, UK) (external) Ross Spencer (The National Archives, UK) (external) Bjarne Andersen ![]() |
Possible Solution approaches |
|
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | JISC1 19th Century Digitised Newspapers Brightsolid Newspapers (TBC) Danish scanned books (TIFF format) |
Solutions |
|
BL JP2 Profile:
Parameter/Field | Value |
Compression | Lossy (detail TBC) |
Number of components | 3 |
Component Transform | Yes (irreversible) |
Tile size | One tile for entire image |
Wavelet Filter | 9-7 irreversible |
Number of levels | Variable; 6 used for test image |
Number of layers | Multiple |
Progression order | RPCL |
Codestream markers | Packet-length markers |
Precincts | 256x256, 256x256,128x128 |
Codeblock size | 64x64 |
Coder Bypass | Yes |
Solutions:
Title | SO1 Simple JP2 file structure checker | ||
Detailed description |
|
||
Solution Champion |
Johan van der Knijff![]() Miguel Ferreira (KEEPS) |
||
Corresponding Issue(s) |
|||
myExperiment Link |
Not yet available | ||
Tool Registry Link |
Simple JP2 file structure checker | ||
Evaluation |
|
Title | SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching |
Detailed description | The digital preservation QA command line tool analyzes JP2K images using bag-of-(visual-)words matching method. The tool aims at detecting geometrical distorsions and double or missing pages for duplicate detection within one book or comparison of old and new versions of the Google book. Identification of corresponding images and duplicate/removal/addition detection is supported in this approach. This method requires global dictionary for the whole book. The difference is measured in [0,1], 0 means most similar, 1 is most different |
Solution Champion |
Huber-Mörk Reinhold![]() ![]() ![]() |
Corresponding Issue(s) |
IS10 Potential bit rot in image files that were stored on CD IS27 Quality assurance in redownload workflows of digitised books |
myExperiment Link |
TBD |
Tool Registry Link |
TBD |
Evaluation |
TBD |
1 Comment
comments.show.hideOct 23, 2012
Miguel Ferreira
This not clear if the CC tools are to be used against TIFFs or JPEG2000s. The dataset only includes TIFF, however, the issue is about JPEG2000 profiles.
PS: There is a broken link in the end of the page.