Title
Valid and well-formed TIFF's with scanline corruption
Detailed description
At NANETH we sometimes encounter TIFF's which render incorrectly or do not render at all, although they are being marked as 'valid' and 'well-formed' when inspected with JHOVE.
Photoshop shows a corrupted 'double image' while IrfanView only shows some random pixels on top of a black image. The ImageMagick viewer is not able to render the image, ImageMagick 'identify' reports that there is not enough data available in scanline 'x'. The SPRUCE "Black and White Pixel Detector", which utilizes the Python Image Library (PIL), also does not report any corruption (ie. black or white pixels) as well.
This is a major issue because validation tools mark the images being 'valid' and 'well-formed'. The solution for this issue would detect this corruption. Ideally the solution is a Python CLI application which can also be used in a automated workflow. Since the images are quite big (around 10 MB), it would need a clever algorithm and use parallel/multi processing.
Issue champion
Maurice de Rooij
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets.
Possible Solution approaches
: http://www.remotesensing.org/libtiff/libtiff.html#scanlines
Context
This issue potentially impacts all TIFF images in our collection. Checking if a file is valid and well-formed seems not enough to prove that it is not corrupted. Ideally we would need a non-visual renderer in our workflows which covers and respects each aspect of the TIFF format specification and reports back any error.
Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets
http://wiki.opf-labs.org/display/SPR/Valid+and+well-formed+TIFF%27s+with+scanline+corruption+dataset
Solutions
http://wiki.opf-labs.org/display/SPR/Solving+TIFF+malformation+using+exiftool
5 Comments
comments.show.hideSep 17, 2012
Paul Wheatley
Good one! Ahead of the game, as always Maurice!
Sep 17, 2012
Maurice de Rooij
Yes, this one is very nasty!
Sep 18, 2012
Gary McGath
JHOVE doesn't look at raster streams for any format, so it's not surprising that defective TIFFs of that kind aren't caught. Analyzing scan lines is a rather complex task, which is why we skipped it.
Sep 18, 2012
Gary McGath
Here are some quick thoughts.
TIFF scan lines can be encoded by quite a number of different ways. For a short-term project, it would make more sense to use an existing library, such as LibTiff, rather than implementing all the necessary decodings.
LibTiff provides three ways to read scan line data: by scanlines, strips, and tiles. Scanlines aren't actually a separate way to store the data, but are provided as a simplified interface and aren't usable in all cases. It probably makes the most sense to use TIFFReadStrip and TIFFReadTile (or perhaps the "encoded" version) and check for returned errors. It shouldn't be necessary to do anything more with the data than make sure it's read without reporting an error.
Sep 19, 2012
Maurice de Rooij
Hi Gary, thanks for the explanation. Using LibTiff was our thought as well.