|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (4)
View Page History

Solving TIFF malformation using exiftool
h5. Detailed description
The [issue page|http://wiki.opf-labs.org/display/SPR/Valid+and+well-formed+TIFF%27s+with+scanline+corruption] describes the problem as (essentially): TIFF files being unusable, despite being "validated" by tools like JHOVE.
h5. Solution Champion
_[aamato|~aamato]_
h5. Corresponding Issue(s)
[Relevant "issues" page.|http://wiki.opf-labs.org/display/SPR/Valid+and+well-formed+TIFF%27s+with+scanline+corruption]
h5. TL;DR
* *Detect*:
{code}
exiftool -n -if '$FileSize * 8 < $ImageWidth * $ImageHeight * $BitsPerSample' \
-p '$filename is TOO SMALL - $FileSize - ($ImageWidth*$ImageHeight*$BitsPerSample)' *.tif
{code}
* *Output*:
{code}
NL-HaNA_2.24.01.09_0_901-2649.tif is TOO SMALL - 9952834 - (3454*2877*16)
NL-HaNA_2.24.01.09_0_901-2809.tif is TOO SMALL - 9952834 - (3454*2877*16)
NL-HaNA_2.24.01.09_0_901-3431.tif is TOO SMALL - 9809944 - (2859*3425*16)
NL-HaNA_2.24.01.09_0_901-4419.tif is TOO SMALL - 9807680 - (3425*2859*16)
NL-HaNA_2.24.01.09_0_901-4451.tif is TOO SMALL - 9807680 - (3425*2859*16)
NL-HaNA_2.24.01.09_0_901-5197.tif is TOO SMALL - 9809944 - (2859*3425*16)
{code}
* *Fix*:
{code}
exiftool -BitsPerSample=8 foobar.tif
{code}
h5. Solution details
A lot of time was spent investigating the details of the file structure, for the images which could not be opened; which ultimately was a dead-end.
Eventually, we realised that the images were claiming to be 16-bit greyscale, but in fact, were actualy *8*\-bit greyscale. (Which, in retrospect, should have been more obvious, based on the rare error messages tools would give us.)
This meant that detection/correction suddenly became a lot easier: check for a discrepancy between image dimension and bit-depth, and the actual file size. (See below.) Similarly, because the pixel data was actually correct, we can correct the images by setting the correct value for the "BitsPerSample" tag.
We used the {{exiftool}} tool to do the detection/correction, because it supports so many options on the command line. In this case, we can do the maths to check "_expected_ file size" versus "_ACTUAL_ file size"; and we can also re-write the header-value we want to change.
*Caveats*: This technique, as described, will only work with single-channel (i.e., greyscale) images. {{Exiftool}} supports a limited (but impressive) amount of command-line manipulation. Unfortunately, for colour images, I couldn't figure out how to get it to do the maths on "number of bits per pixel", since they're represented (textually) as the number of bits _per channel_ (e.g., "8 8 8" vs. "24"). You could still apply the same techniques, and continue to use {{exiftool}}, but you'd probably just use it for metadata _extraction_, and wrap it in a shell/python/whatever script to do the analysis.
h5. Important questions and takeaways (see slides below)
* What does "valid" mean for a file?
** Well-formed?
** Verified externally by a tool?
** Matching a spec?
** Internally consistent?
* What do we learn from this?
** Don't always assume your vendors/digitisers are doing the job right.
** Don't always assume that "successful validation" is meaningful. (Also: learn the limitations of your tools.)
** The only thing better than double-checking is triple-checking.
** KNOW WHAT YOU ARE "PRESERVING"\!
** (and: exiftool is pretty awesome)
h5. [Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]
[ExifTool|http://wiki.opf-labs.org/display/TR/ExifTool]
h5. Evaluation
[~techmaurice]
We are going to incorporate this check in our QA process and request our vendors to do so as well. We will notify JHOVE developers as well to request them to incorporate this check.
h5. Slides from final day (images/PPT)
[PowerPoint Slides|^SPRUCE - TIFFs.ppt]
!Slide1.PNG|border=1!
h3. This was the starting point:
!Slide2.PNG|border=1!
h3. This was where we want to be:
!Slide3.PNG|border=1!
h3. So what did we do to get there?
!Slide5.PNG|border=1!
h3. We moved one bit\!
(I'd like to suggest this for "least work done to achieve a SPRUCE mashup solution" :-) )


!Slide6.PNG|border=1!
h3. Raising questions about "valid":
!Slide7.PNG|border=1!
h3. So how can we not have to do this again? We can use `exiftool` to evaluate the relevant properties of the TIFF images, and flag up an inconsistency:
!Slide8.PNG|border=1!
h3. And exiftool also lets us fix the problem:
!Slide9.PNG|border=1!
h3. What do we learn?
!Slide10.PNG|border=1!
h3. Also: Consequences of bit-rot _can_ be severe\!
h3. Also: Consequences of (even single-)bit-rot can be subtle and severe\!