Title |
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? |
Detailed description | Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile". If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner. The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile" Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE. Example 1: JISC1 Newspapers Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers). Example 2: Brightsolid Newspapers Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page. |
Scalability Challenge |
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks). |
Issue champion | Maureen Pennock![]() |
Other interested parties |
Sven Schlarb![]() Christy Henshaw (Wellcome Library, UK) (external) Ross Spencer (The National Archives, UK) (external) Bjarne Andersen ![]() |
Possible Solution approaches |
|
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | JISC1 19th Century Digitised Newspapers Brightsolid Newspapers (TBC) Danish scanned books (TIFF format) |
Solutions |
|
BL JP2 Profile:
Parameter/Field | Value |
Compression | Lossy (detail TBC) |
Number of components | 3 |
Component Transform | Yes (irreversible) |
Tile size | One tile for entire image |
Wavelet Filter | 9-7 irreversible |
Number of levels | Variable; 6 used for test image |
Number of layers | Multiple |
Progression order | RPCL |
Codestream markers | Packet-length markers |
Precincts | 256x256, 256x256,128x128 |
Codeblock size | 64x64 |
Coder Bypass | Yes |
Labels: