The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. A preservation planning exercise at the British Library indicated that migration to JPEG2000 would reduce storage size and costs while at the same time facilitating enhanced user access. Lower cost is, of course, a very important factor in preservation! The cost benefit can only be realized if we can remove the original TIFFs and this can only be done if we can provide evidence of successful migration. This scenario and scenario 3 are aimed at providing this evidence and promoting confidence in the migration process.
Here we define a successful migration as where:
Scenario 2 will be limited validating metadata/colour profile/image format of the migrated image. Scenario 3 enhances the quality assurance process by comparing the original and migrated files using image comparison techniques such as perceptual hashing This scenario addresses scalability and automation. The dataset is large (80TB, 2 million pages) and if there is any manual quality assurance it'll be done on a very small sample. As such any solution must:
|
Dataset:
Title |
JISC1 19th Century Digitised Newspapers |
Description | The collection consists of 2.2million pages of digitised 19th Century Newspapers. Content includes:
|
Licensing | The collection sample is available for use under a BL licence, restricting usage for research only. Otherwise it is not restricted to SCAPE Project partners. See full licence![]() |
Owner | British Library |
Dataset Location |
TBC |
Collection expert | TBC |
Issues brainstorm | |
List of issues | IS44 QA of migrated images IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? IS1 Digitised TIFFs do not meet storage and access requirements |
Issue:
Title |
IS1 Digitised TIFFs do not meet storage and access requirements |
Detailed description | A important part of digital preservation is the willingness and financial commitment of a memory institution to preserve the data for the long term. Given the time scales in question any cost saving is to be welcomed. At the BL, as elsewhere ![]() As a side benefit, replacing the TIFF images with alternative representations will facilitate access to the materials - smaller files to manipulate and download and native tool support in browsers and standard OSs. Access metrics also help to obtain the commitment of the memory institution to preserve data. |
Scalability Challenge |
The JISC1 collection is high volume (80TB). There are no specific requirements around performance of migration+QA solutions, although it would be desirable to complete processing within weeks rather than months. |
Issue champion | ![]() |
Other interested parties |
Schlarb Sven![]() |
Possible approaches | Migration from TIFF to JPEG2000 |
Context | |
Lessons Learned | |
Training Needs | |
Datasets | JISC1 19th Century Digitised Newspapers |
Solutions | SO31 Preservation Grade TIFF to JPEG2000 Migration |
Issue:
Title |
IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? |
Detailed description | Some forms of content arrive at the preserving institution and will be preserved "as is" regardless of how the files have been constructed (eg. web archived content). Other content can be acquired under a specific agreement with the creator or publisher, and the preserving institution typically expects the content in a particular form. This may go further than describing formats used, and will actually describe specific technical constraints on the construction of the files. For example, the BL's Technical Guidelines for Digitisation state that digitised TIFFs should be TIFF version 6, LZW compressed and each TIFF should contain only one image. These technical constraints are typically described as a "format profile". If content received from the creator or publisher does not conform to the agreed profile, the preserving institution can reject the content and request new/revised/re-scanned content. However, the preserving organisation must have the capability to verify a digital object's compliance with a profile, and if it is not compliant, identify how it fails. It is necessary to perform this check in an automated manner. The SCAPE project proposal calls this "Policy Driven Validation". Policy is most likely not the right word - it would be better to call it something like "profile" Image files may be constucted imperfectly or may damaged during storage or transfer. It would therefore be useful to be able to verify in an automated fashion that the files are complete (i.e have not been arbitrarily truncated) and that the files are valid and/or will render in one or more common viewing applications without error. Examples of truncated JPEG2000s in the JISC1 dataset are typically reported as valid and well formed by JHOVE. Example 1: JISC1 Newspapers Within this dataset there are a number of truncated JPEG2000 images. These should be checked for completeness, validity and renderability (i.e. renders in one or more typical JPEG2000 viewers). Example 2: Brightsolid Newspapers Digitisation of this collection is ongoing. There is a need to check in and QA new JPEG2000 images. This should involve a check that each image conforms to the new BL JPEG2000 profile, as well as checking for completeness, validity and renderability. The BL profile can be found at the end of this page. |
Scalability Challenge |
Large scale digitisation projects need to check in content and verify its compliance to a profile quickly and efficiently despite the high volume of data. For example, JPEG2000s digitised for a current BL project will be received at between 0.25 and 0.5TB per day. Checking must be performed at a sufficient rate to prevent a build up of material and allow timely rejection of content that does not match the profile (problem pages can be re-digitised if issues are identified in a timely manner: i.e. within days rather than weeks). |
Issue champion | Maureen Pennock![]() |
Other interested parties |
Sven Schlarb![]() Christy Henshaw (Wellcome Library, UK) (external) Ross Spencer (The National Archives, UK) (external) Bjarne Andersen ![]() |
Possible Solution approaches |
|
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices) |
Training Needs | Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP. |
Datasets | JISC1 19th Century Digitised Newspapers Brightsolid Newspapers (TBC) Danish scanned books (TIFF format) |
Solutions |
|
BL JP2 Profile:
Parameter/Field | Value |
Compression | Lossy (detail TBC) |
Number of components | 3 |
Component Transform | Yes (irreversible) |
Tile size | One tile for entire image |
Wavelet Filter | 9-7 irreversible |
Number of levels | Variable; 6 used for test image |
Number of layers | Multiple |
Progression order | RPCL |
Codestream markers | Packet-length markers |
Precincts | 256x256, 256x256,128x128 |
Codeblock size | 64x64 |
Coder Bypass | Yes |
Issue:
Title |
IS44 Migrated image metadata must map or match to those of the original |
Detailed description | IS2 Do acquired files conform to an agreed technical profile, are they valid and are they complete? deals with issues of format specific significant properties of a migration and if those properties either match or have been translated into the migration format in a way sympathetic to then needs of digital preservation. Image files also contain other embedded metadata - EXIF data, etc. In some cases the original will contain data that does not need to be, or is inappropriate to be translated to the migration (a file format value for example), but often significant properties such as "Creator" or "Creation Date" should be preserved in the migration. There may also be times where original metadata values need to be mapped or translated into the migration (eg. where a capture agency uses a shorthand for the organization name). |
Scalability Challenge |
The quantity of the images and the size of some of the TIFFs. |
Issue champion | Peter Cliff (BL) |
Other interested parties |
Schlarb Sven![]() |
Possible Solution approaches | The solution used is dependent on what we consider to be significant properties of the original and which of those properties need to be successfully captured in the migration. Each of the originals will have embedded metadata. The first task will be to identify what, if any of those fields should be migrated to the new format and then those field values should be extracted from the original and the migration and compared to see if they match. It is possible that the migration process will need to know what embedded metadata to migrate, possibly a two step process - migrate image and then re-insert metadata fields. This suggests that the migration tool should allow for parameters to specify which metadata fields to keep. Each of the originals will have some visual properties that we may want to capture and verify on migration. This could include comparing the two images pixel by pixel, comparing their histograms, or perceptual hashing techniques. Here we may include using OCR to extract text from these images and considering that text extraction to be a significant property. This sounds like a useful approach, but is specific to migration of images of text and as such I think this validation method should be considered secondary to a more general image solution. Finally, while each image in the collection was validated at ingest, it may be worth validating both input and output (migrated) formats meaning we will need some format validation tools. (Luckily some very good ones exist!) |
Context | See LSDRT2 Validating files migrated from TIFF to JPEG2000 |
Lessons Learned | TBD |
Training Needs | Should be added to the Solution |
Datasets | British Library 19th Century newspapers |
Solutions | SO32 - Metadata Extraction SO33 - Metadata Comparison |
Solution
Title | SO15 JP2 validator and properties extractor |
Detailed description | Migration to JPEG2000 can be problematic - both because of the interpretation of the standard and also because migration tools may fail mid-process. We need a post-migration quality assurance tool that can validate a JPEG2000 to ensure that it:
|
Solution Champion |
Johan van der Knijff![]() |
Corresponding Issue(s) |
|
myExperiment Link |
Not yet available |
Tool Registry Link |
jpylyzer![]() |
Evaluation |
Solution
Title | SO30 Automated assessment of JP2 against a technical profile |
Detailed description | A simple method for doing a rule-based assessment of JP2 images using Jpylyzer and the Schematron validation language. See the following OPF blog post for details: http://openplanetsfoundation.org/blogs/2012-09-04-automated-assessment-jp2-against-technical-profile![]() |
Solution Champion |
Johan van der Knijff![]() |
Corresponding Issue(s) |
|
myExperiment Link |
A link to a corresponding workflow on myExperiment![]() |
Tool Registry Link |
Jpylyzer; |
Evaluation |
Any notes or links on how the solution performed. This will be developed and formalised by the Testbed SP. |
Solution
Title | SO31 Preservation Grade TIFF to JPEG2000 Migration |
Detailed description | JPEG2000 has some issues as a preservation format![]() The tool should either ensure that all embedded metadata from the source is embedded into the migration or provide parameters to enable the execution workflow to dictate what is significant and should be kept. Approaches could include two-step process - migration of the image followed by appropriate cleanup/reinstatement of significant properties (metadata here)? |
Solution Champion |
Peter Cliff (BL) |
Corresponding Issue(s) |
|
myExperiment Link |
|
Tool Registry Link |
A number of tools exist that can migrate a TIFF to a JPEG2000. The common ones are failing to do this in a consistent fashion. This solution therefore does not exist. |
Evaluation |
|
Solution
Title | SO32 Image Metadata Extractor |
Detailed description | Simple tool to extract metadata from any given image file and provide a standard output. |
Solution Champion |
Peter Cliff (BL) |
Corresponding Issue(s) | |
myExperiment Link |
|
Tool Registry Link |
Lots of options for this. One good one might be EXIF to DC XML normalizer![]() |
Evaluation |
|
Solution
Title | SO33 Image Metadata Compare |
Detailed description | During a migration image metadata from the source should be mapped to image metadata in the migration (in case any external metadata and the image ever part company!). SO32 is a tool for extracting the metadata from the source and migration in a standard and comparable way. To complicate things the source and migration field values may have (deliberately) changed during the migration. It may be enough to check for values in the desired fields (not everything in the source needs to have been mapped to the migration - that'll need to be configurable). |
Solution Champion |
Peter Cliff (BL) |
Corresponding Issue(s) |
|
myExperiment Link |
|
Tool Registry Link |
I imagine there are plenty of options here! |
Evaluation |
|