Skip to end of metadata
Go to start of metadata
The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. A preservation planning exercise at the British Library indicated that migration to JPEG2000 would reduce storage size and costs while at the same time facilitating enhanced user access. Lower cost is, of course, a very important factor in preservation! The cost benefit can only be realized if we can remove the original TIFFs and this can only be done if we can provide evidence of successful migration. This scenario and scenario 3 are aimed at providing this evidence and promoting confidence in the migration process.

Here we define a successful migration as where:

  • Significant properties of the original are not lost
    • Significant properties include:
      • Relevant embedded metadata.
      • Image properties like resolution and size.
      • Visual characteristics of the image such as colour reproduction.
  • The JPEG2000 is valid and complete.

Scenario 2 will be limited validating metadata/colour profile/image format of the migrated image. Scenario 3 enhances the quality assurance process by comparing the original and migrated files using image comparison techniques such as perceptual hashing

This scenario addresses scalability and automation. The dataset is large (80TB, 2 million pages) and if there is any manual quality assurance it'll be done on a very small sample. As such any solution must:

  • Operate reliably at scale (80TB, 2 million pages)
  • Automate QA
  • QA performed by independent process from the migration process
  • Provide strong evidence of significant properties in the migration matching those in the original


JISC1 19th Century Digitised Newspapers
Description The collection consists of 2.2million pages of digitised 19th Century Newspapers. Content includes:
  • TIFF page masters, 
  • TIFF page service copies,
  • TIFF article service copies,
  • XML METS metadata, 
  • XML. 
    Samples of migrated JPEG2000 files are also available, including some truncated JPEG2000s that were damaged during a faulty migration process.

    The complete collection in TIFF form is ~80TB. 
Licensing The collection sample is available for use under a BL licence, restricting usage for research only. Otherwise it is not restricted to SCAPE Project partners. See full licence
Owner British Library
Dataset Location
Collection expert TBC
Issues brainstorm  
List of issues IS44 QA of migrated images
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
IS1 Digitised TIFFs do not meet storage and access requirements


IS1 Digitised TIFFs do not meet storage and access requirements
Detailed description A important part of digital preservation is the willingness and financial commitment of a memory institution to preserve the data for the long term. Given the time scales in question any cost saving is to be welcomed.

At the BL, as elsewhere, the cost of storing uncompressed TIFFs currently outweighs the risk of replacing these images with a (perhaps) compressed format.

As a side benefit, replacing the TIFF images with alternative representations will facilitate access to the materials - smaller files to manipulate and download and native tool support in browsers and standard OSs.

Access metrics also help to obtain the commitment of the memory institution to preserve data.
Scalability Challenge
The JISC1 collection is high volume (80TB). There are no specific requirements around performance of migration+QA solutions, although it would be desirable to complete processing within weeks rather than months.
Issue champion Peter Cliff (BL)
Other interested parties
Schlarb Sven (ONB)
Possible approaches Migration from TIFF to JPEG2000
Lessons Learned
Training Needs  
Datasets JISC1 19th Century Digitised Newspapers
Solutions SO31 Preservation Grade TIFF to JPEG2000 Migration


IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete?
Detailed description Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile".
If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner.

The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile"

Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE.

Example 1: JISC1 Newspapers
Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers).

Example 2: Brightsolid Newspapers
Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page.
Scalability Challenge
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks).
Issue champion Maureen Pennock (BL)
Other interested parties
Sven Schlarb(ONB)
Christy Henshaw (Wellcome Library, UK) (external)
Ross Spencer (The National Archives, UK) (external)
Bjarne Andersen (SB) - SB is interested, but we cannot work on this issue until the relevant digitisation project (Newspapers) have begun
Possible Solution approaches
  • Any developments to meet this Issue should consider the following, and ensure appropriate liaison where solutions may exist or may be under development:
    • JHOVE/JHOVE2 may provide some of the solution if developed further. Will JHOVE2 developments meet these needs?
    • Wellcome Library may develop some solutions in this area
    • Ross Spencer has done some development in this area which might work well with Johan's developments (discussions are ongoing)
    • Modification of existing rendering tools to do thorough parsing / rendering check
    • Watch may contribute for the solution with the triggers:
      • Monitor characterization tools
      • Monitor changes in policy
  • SB
    • 1. Develop language (XML ?) to describe institutional collection profiles
    • 2. Write comparator that compares the output of characterisation tools with the profile to judge if files conform not only to the formal file format specification but also to the local institutional requirements
    • 3. This "judgement" to potentially be used i a Taverna workflow to sort large amounts of files in basically 2 piles: those that conform to the profile and those that do not conform.
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets JISC1 19th Century Digitised Newspapers
Brightsolid Newspapers (TBC)

Danish scanned books (TIFF format)

BL JP2 Profile:

Parameter/Field Value
Compression Lossy (detail TBC)
Number of components 3
Component Transform Yes (irreversible)
Tile size One tile for entire image
Wavelet Filter 9-7 irreversible
Number of levels Variable; 6 used for test image
Number of layers Multiple
Progression order RPCL
Codestream markers Packet-length markers
Precincts 256x256, 256x256,128x128
Codeblock size 64x64
Coder Bypass Yes


IS44 Migrated image metadata must map or match to those of the original
Detailed description IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? deals with issues of format specific significant properties of a migration and if those properties either match or have been translated into the migration format in a way sympathetic to then needs of digital preservation.

Image files also contain other embedded metadata - EXIF data, etc. In some cases the original will contain data that does not need to be, or is inappropriate to be translated to the migration (a file format value for example), but often significant properties such as "Creator" or "Creation Date" should be preserved in the migration. There may also be times where original metadata values need to be mapped or translated into the migration (eg. where a capture agency uses a shorthand for the organization name).
Scalability Challenge
The quantity of the images and the size of some of the TIFFs.
Issue champion Peter Cliff (BL)
Other interested parties
Schlarb Sven (ONB)
Possible Solution approaches The solution used is dependent on what we consider to be significant properties of the original and which of those properties need to be successfully captured in the migration.

Each of the originals will have embedded metadata. The first task will be to identify what, if any of those fields should be migrated to the new format and then those field values should be extracted from the original and the migration and compared to see if they match. It is possible that the migration process will need to know what embedded metadata to migrate, possibly a two step process - migrate image and then re-insert metadata fields. This suggests that the migration tool should allow for parameters to specify which metadata fields to keep.

Each of the originals will have some visual properties that we may want to capture and verify on migration. This could include comparing the two images pixel by pixel, comparing their histograms, or perceptual hashing techniques.

Here we may include using OCR to extract text from these images and considering that text extraction to be a significant property. This sounds like a useful approach, but is specific to migration of images of text and as such I think this validation method should be considered secondary to a more general image solution.

Finally, while each image in the collection was validated at ingest, it may be worth validating both input and output (migrated) formats meaning we will need some format validation tools. (Luckily some very good ones exist!)

Context See LSDRT2 Validating files migrated from TIFF to JPEG2000 
Lessons Learned TBD
Training Needs Should be added to the Solution
Datasets British Library 19th Century newspapers
Solutions SO32 - Metadata Extraction
SO33 - Metadata Comparison


Title SO15 JP2 validator and properties extractor
Detailed description Migration to JPEG2000 can be problematic - both because of the interpretation of the standard and also because migration tools may fail mid-process.

We need a post-migration quality assurance tool that can validate a JPEG2000 to ensure that it:

  1. Conforms to the JPEG 2000 Part 1 (JP2) specification.
  2. Either:
    1. Conforms to a consistent preservation-worthy intepretation of the specification.
    2. Conforms to an interpretation of the specification provided by the user.
  3. That the JPEG 2000 is complete and capable of being rendered. 

    The jpylyzer tool has capability to analyse a file and verify whether its contents qualify as valid JPEG 2000 Part 1 (JP2). 
    It also reports back its properties which can then be used to ensure conformance to an interpretation of the specification. 
    For more information see this blog post:

    Comments from TNA
    Commments from Wellcome Library
Solution Champion
Johan van der Knijff (KB)
Corresponding Issue(s)
myExperiment Link
Not yet available
Tool Registry Link


Title SO30 Automated assessment of JP2 against a technical profile
Detailed description A simple method for doing a rule-based assessment of JP2 images using Jpylyzer and the Schematron validation language. See the following OPF blog post for details:
Solution Champion
Johan van der Knijff (KB)
Corresponding Issue(s)
myExperiment Link
A link to a corresponding workflow on myExperiment
Tool Registry Link
Any notes or links on how the solution performed. This will be developed and formalised by the Testbed SP.


Title SO31 Preservation Grade TIFF to JPEG2000 Migration
Detailed description JPEG2000 has some issues as a preservation format. It would be nice to have a tool that can migrate a TIFF to JPEG2000 in a consistent and preservation safe fashion, maintaining (or normalizing) the embedded ICC profile, resolution headers and any other metadata that may emerge as being significant.

The tool should either ensure that all embedded metadata from the source is embedded into the migration or provide parameters to enable the execution workflow to dictate what is significant and should be kept.

Approaches could include two-step process - migration of the image followed by appropriate cleanup/reinstatement of significant properties (metadata here)?
Solution Champion
Peter Cliff (BL)
Corresponding Issue(s)
myExperiment Link
Tool Registry Link
A number of tools exist that can migrate a TIFF to a JPEG2000. The common ones are failing to do this in a consistent fashion. This solution therefore does not exist.


Title SO32 Image Metadata Extractor
Detailed description Simple tool to extract metadata from any given image file and provide a standard output. 
Solution Champion
Peter Cliff (BL)
Corresponding Issue(s)
myExperiment Link
Tool Registry Link
Lots of options for this. One good one might be EXIF to DC XML normalizer


Title SO33 Image Metadata Compare
Detailed description During a migration image metadata from the source should be mapped to image metadata in the migration (in case any external metadata and the image ever part company!).

SO32 is a tool for extracting the metadata from the source and migration in a standard and comparable way.

To complicate things the source and migration field values may have (deliberately) changed during the migration. It may be enough to check for values in the desired fields (not everything in the source needs to have been mapped to the migration - that'll need to be configurable).

Solution Champion
Peter Cliff (BL)
Corresponding Issue(s)
myExperiment Link

Tool Registry Link
I imagine there are plenty of options here!

lsdr lsdr Delete
scenario scenario Delete
lsdrscenario lsdrscenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.