Skip to end of metadata
Go to start of metadata

Dataset:

Title
JISC1 19th Century Digitised Newspapers
Description The collection consists of 2.2million pages of digitised 19th Century Newspapers. Content includes:
  • TIFF page masters, 
  • TIFF page service copies,
  • TIFF article service copies,
  • XML METS metadata, 
  • XML. 
    Samples of migrated JPEG2000 files are also available, including some truncated JPEG2000s that were damaged during a faulty migration process.

    The complete collection in TIFF form is ~80TB. 
Licensing The collection sample is available for use under a BL licence, restricting usage for research only. Otherwise it is not restricted to SCAPE Project partners. See full licence
Owner British Library
Dataset Location
TBC
Collection expert TBC
Issues brainstorm  
List of issues IS44 QA of migrated images
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
IS1 Digitised TIFFs do not meet storage and access requirements

Unable to render {include} Couldn't find a page to include called: TIFF with scanned books

Issue:

Title
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Detailed description Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile".
If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner.

The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile"

Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE.

Example 1: JISC1 Newspapers
Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers).

Example 2: Brightsolid Newspapers
Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page.
Scalability Challenge
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks).
Issue champion Maureen Pennock (BL)
Other interested parties
Sven Schlarb(ONB)
Christy Henshaw (Wellcome Library, UK) (external)
Ross Spencer (The National Archives, UK) (external)
Bjarne Andersen (SB) - SB is interested, but we cannot work on this issue until the relevant digitisation project (Newspapers) have begun
Possible Solution approaches
  • Any developments to meet this Issue should consider the following, and ensure appropriate liaison where solutions may exist or may be under development:
    • JHOVE/JHOVE2 may provide some of the solution if developed further. Will JHOVE2 developments meet these needs?
    • Wellcome Library may develop some solutions in this area
    • Ross Spencer has done some development in this area which might work well with Johan's developments (discussions are ongoing)
    • Modification of existing rendering tools to do thorough parsing / rendering check
  • KEEPS
    • Watch may contribute for the solution with the triggers:
      • Monitor characterization tools
      • Monitor changes in policy
  • SB
    • 1. Develop language (XML ?) to describe institutional collection profiles
    • 2. Write comparator that compares the output of characterisation tools with the profile to judge if files conform not only to the formal file format specification but also to the local institutional requirements
    • 3. This "judgement" to potentially be used i a Taverna workflow to sort large amounts of files in basically 2 piles: those that conform to the profile and those that do not conform.
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets JISC1 19th Century Digitised Newspapers
Brightsolid Newspapers (TBC)

Danish scanned books (TIFF format)
Solutions

BL JP2 Profile:

Parameter/Field Value
Compression Lossy (detail TBC)
Number of components 3
Component Transform Yes (irreversible)
Tile size One tile for entire image
Wavelet Filter 9-7 irreversible
Number of levels Variable; 6 used for test image
Number of layers Multiple
Progression order RPCL
Codestream markers Packet-length markers
Precincts 256x256, 256x256,128x128
Codeblock size 64x64
Coder Bypass Yes

Solutions:

Title SO1 Simple JP2 file structure checker
Detailed description
Note that this development has now been replaced by: [SO15 JP2 validator and properties extractor (jpylyzer)]



Assuming that a Preservation Plan has already been devised, a complete solutions for this issue would be:

Create a migration and quality assurance workflow (based on taverna workflows in the beginning and on hadoop later on). The wf should be composed by the following steps:

1. The first step would be to use one of the TIFF to JPEG2000 migration services provided by the AS WP.
2. The second step would be to do an image comparison service to assess how much pixel information has been changed between the original tiff and the recently coverted jp2 file. This service should be provided by QA WP.
3. The results of the evaluation would dictate what to do next, that being: a) archive and add preservation metadata or b) repeat the process.

Other things that should be considered are:
 
In brief, when jp2StructCheck analyses a file, it first parses the top-level box structure, and collects the unique identifiers (or marker codes) of all boxes. If it encounters the box that contains the code stream, it checks if the code stream is terminated by a valid end-of-codestream marker. Finally, it checks if the file contains all the compulsory/required top-level boxes. These are: JPEG 2000 signature box, File Type box, JP2 Header box, Contiguous Codestream box.

For more information see this blog post:
http://www.openplanetsfoundation.org/blogs/2011-09-01-simple-jp2-file-structure-checker

Solution Champion
Johan van der Knijff (KB)
Miguel Ferreira (KEEPS)
Corresponding Issue(s)
myExperiment Link
Not yet available
Tool Registry Link
Simple JP2 file structure checker
Evaluation
Title SO9 Matchbox - Image comparison tool based on bag-of-(visual-)words matching
Detailed description The digital preservation QA command line tool analyzes JP2K images using bag-of-(visual-)words matching method. The tool aims at detecting geometrical distorsions and double or missing pages for duplicate detection within one book or comparison of old and new versions of the Google book. Identification of corresponding images and duplicate/removal/addition detection is supported in this approach. This method requires global dictionary for the whole book.
The difference is measured in [0,1], 0 means most similar, 1 is most different
Solution Champion
Huber-Mörk Reinhold (AIT), Schindler Alexander (AIT), Graf Roman (AIT)
Corresponding Issue(s)
IS10 Potential bit rot in image files that were stored on CD
IS27 Quality assurance in redownload workflows of digitised books
myExperiment Link
TBD
Tool Registry Link
TBD
Evaluation
TBD
Labels:
lsdr lsdr Delete
scenario scenario Delete
lsdrscenario lsdrscenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Oct 23, 2012

    This not clear if the CC tools are to be used against TIFFs or JPEG2000s. The dataset only includes TIFF, however, the issue is about JPEG2000 profiles.

    PS: There is a broken link in the end of the page.