Valid and well-formed TIFF's with scanline corruption

Skip to end of metadata
Go to start of metadata

Title
Valid and well-formed TIFF's with scanline corruption

Detailed description
At NANETH we sometimes encounter TIFF's which render incorrectly or do not render at all, although they are being marked as 'valid' and 'well-formed' when inspected with JHOVE.

Photoshop shows a corrupted 'double image' while IrfanView only shows some random pixels on top of a black image. The ImageMagick viewer is not able to render the image, ImageMagick 'identify' reports that there is not enough data available in scanline 'x'. The SPRUCE "Black and White Pixel Detector", which utilizes the Python Image Library (PIL), also does not report any corruption (ie. black or white pixels) as well.

This is a major issue because validation tools mark the images being 'valid' and 'well-formed'. The solution for this issue would detect this corruption. Ideally the solution is a Python CLI application which can also be used in a automated workflow. Since the images are quite big (around 10 MB), it would need a clever algorithm and use parallel/multi processing.

Issue champion
Maurice de Rooij

Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets.

Possible Solution approaches
: http://www.remotesensing.org/libtiff/libtiff.html#scanlines

Context
This issue potentially impacts all TIFF images in our collection. Checking if a file is valid and well-formed seems not enough to prove that it is not corrupted. Ideally we would need a non-visual renderer in our workflows which covers and respects each aspect of the TIFF format specification and reports back any error.

Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice

Datasets
http://wiki.opf-labs.org/display/SPR/Valid+and+well-formed+TIFF%27s+with+scanline+corruption+dataset

Solutions
http://wiki.opf-labs.org/display/SPR/Solving+TIFF+malformation+using+exiftool

Name Size Creator Creation Date Comment  
JPEG File Screenshot PS4.jpg 442 kB Maurice de Rooij Sep 17, 2012 13:11 Screenshot Photoshop 4  
JPEG File Screenshot_Irfanview.jpg 49 kB Maurice de Rooij Sep 17, 2012 13:11 Screenshot Irfanview  
Labels:
spruce_london spruce_london Delete
issue issue Delete
tiff tiff Delete
corruption corruption Delete
scanline scanline Delete
image image Delete
jhove jhove Delete
validation validation Delete
well-formedness well-formedness Delete
bit_rot bit_rot Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Sep 17, 2012

    Good one! Ahead of the game, as always Maurice!

  2. Sep 17, 2012

    Yes, this one is very nasty!

  3. Sep 18, 2012

    JHOVE doesn't look at raster streams for any format, so it's not surprising that defective TIFFs of that kind aren't caught. Analyzing scan lines is a rather complex task, which is why we skipped it.

  4. Sep 18, 2012

    Here are some quick thoughts.

    TIFF scan lines can be encoded by quite a number of different ways. For a short-term project, it would make more sense to use an existing library, such as LibTiff, rather than implementing all the necessary decodings.

    LibTiff provides three ways to read scan line data: by scanlines, strips, and tiles. Scanlines aren't actually a separate way to store the data, but are provided as a simplified interface and aren't usable in all cases. It probably makes the most sense to use TIFFReadStrip and TIFFReadTile (or perhaps the "encoded" version) and check for returned errors. It shouldn't be necessary to do anything more with the data than make sure it's read without reporting an error.

    1. Sep 19, 2012

      Hi Gary, thanks for the explanation. Using LibTiff was our thought as well.