Skip to end of metadata
Go to start of metadata
Title
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Detailed description Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile".
If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner.

The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile"

Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE.

Example 1: JISC1 Newspapers
Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers).

Example 2: Brightsolid Newspapers
Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page.
Scalability Challenge
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks).
Issue champion Maureen Pennock (BL)
Other interested parties
Sven Schlarb(ONB)
Christy Henshaw (Wellcome Library, UK) (external)
Ross Spencer (The National Archives, UK) (external)
Bjarne Andersen (SB) - SB is interested, but we cannot work on this issue until the relevant digitisation project (Newspapers) have begun
Possible Solution approaches
  • Any developments to meet this Issue should consider the following, and ensure appropriate liaison where solutions may exist or may be under development:
    • JHOVE/JHOVE2 may provide some of the solution if developed further. Will JHOVE2 developments meet these needs?
    • Wellcome Library may develop some solutions in this area
    • Ross Spencer has done some development in this area which might work well with Johan's developments (discussions are ongoing)
    • Modification of existing rendering tools to do thorough parsing / rendering check
  • KEEPS
    • Watch may contribute for the solution with the triggers:
      • Monitor characterization tools
      • Monitor changes in policy
  • SB
    • 1. Develop language (XML ?) to describe institutional collection profiles
    • 2. Write comparator that compares the output of characterisation tools with the profile to judge if files conform not only to the formal file format specification but also to the local institutional requirements
    • 3. This "judgement" to potentially be used i a Taverna workflow to sort large amounts of files in basically 2 piles: those that conform to the profile and those that do not conform.
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets JISC1 19th Century Digitised Newspapers
Brightsolid Newspapers (TBC)

Danish scanned books (TIFF format)
Solutions

BL JP2 Profile:

Parameter/Field Value
Compression Lossy (detail TBC)
Number of components 3
Component Transform Yes (irreversible)
Tile size One tile for entire image
Wavelet Filter 9-7 irreversible
Number of levels Variable; 6 used for test image
Number of layers Multiple
Progression order RPCL
Codestream markers Packet-length markers
Precincts 256x256, 256x256,128x128
Codeblock size 64x64
Coder Bypass Yes

Labels:
characterisation characterisation Delete
lsdr lsdr Delete
qa qa Delete
issue issue Delete
conformance conformance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.