|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (1)
View Page History

h3. {color:#000000}The research question{color}
{color:#000000}I have never doubted the JHOVE TIFF module. The JHOVE TIFF module is always right. Everybody says so. That's why nobody uses the myriad alternatives to it, although it's so easy to write a TIFF validator, I could almost do it myself.{color}
{color:#000000}But while my colleague Michelle and I are drafting a paper for the IDCC this february, it dawned on me: "Everybody" has never written about the infallability of JHOVE in a paper or Blogpost so far, or has run any openly available test that I know of. Besides, the "myriad alternatives" often seem not to be easily usable for me on my windows machine with my limited experience with command-line-tools and batch-scripting.{color}
Last fall, I have compared the validation tools JHOVE and Bad Peggy and how they both deal with JPEG validation (see [OPF Blogpost|http://openpreservation.org/blog/2016/11/29/jpegvalidation/]). My goal was to analyse if the JHOVE JPEG module is reliable, as we are basing our preservation decisions on it. In theory, my goal is the same with the examination for this Blogpost, but my initial intention was much darker: I wanted to prove that the JHOVE TIFF module indeed is infallable and that TIFF validation is, as I have always known, easy peasy. As the analysis went on I had to admit that the reality is much more complicated.
The statement of a validation tool usually is relied on without a second thought, although most validation tools are not free from false negatives and false positives. As the JHOVE validation tool is widespread in the digital preservation community and integrated in out-of-the-box digital preservation software like Rosetta and Preservica, the reliance of JHOVE is especially interesting for the formats we possibly all have in our archives, like TIFF images.
My research question: _Is the JHOVE TIFF module really that good in comparison with other tools?_
And, as a side-effect: _Is TIFF validation really easy peasy?_
h3. TIFF validation tools
First, there seem to be plethora of tools to test TIFF-validity, analyse TIFF-tags and even repair common errors. Some are listed in [COPTR|http://coptr.digipres.org/index.php?title=Special%3ASearch&profile=default&search=TIFF+validation&fulltext=Search] (search for "TIFF" and "validation", though some tools like ExifTool do some validation and are not marked as validation tools). Furthermore, the libtiff library offers many programs that can be integrated in other tools (see e. g. [this collection of TIFF-tools|http://www.libtiff.org/tools.html]). Most tools are no out-of-the-box tools with a nice GUI like JHOVE, which can be used by you and me on a windows machine.
I selected the following tools for my test:
|| || Validation Tool \\ || version \\ || How to use \\ || remark ||
| 1 \\ | JHOVE \\ | 1.14.6 \\ | GUI and java library \\ | |
| 2 \\ | ImageMagick \\ | 7.0.3 \\ | Command-line, batch-script \\ | help for the batch-script via twitter from [David Underdown|https://twitter.com/DavidUnderdown9] and the ImageMagick people \\ |
| 3 \\ | ExifTool \\ | 10.37 \\ | Command-line, batch-script | help for the batch-script from Mario from the [German nestor format identification group|https://wiki.dnb.de/display/NESTOR/AG+Formaterkennung]\\ |
| 4 \\ | DPF Manager | 3.1 \\ | GUI | |
| 5 \\ | checkit_tiff \\ | 0.2.0 | runs on linux only yet \\ | Andreas, checkit_tiff developer from the SLUB Dresden has run the test suite for me \\ |
| 6 \\ | LibTIFF \\ | 4.0.7 | runs on linux only \\ | Heinz from the [German nestor format identification group|https://wiki.dnb.de/display/NESTOR/AG+Formaterkennung] has run the test suite for me |
In summary, I was able to analyse the test suite with six tools. I had some help with a suitable batch-script for me needs for ImageMagick and ExifTool, but at least I could run the test on my windows machine. For checkit_tiff and LibTIFF two colleagues helped me out and sent me the findings for me to analyse
h4. JHOVE
*{_}Validation{_}*: JHOVE is my to-go-validator if it is about TIFF files. The findings are intelligible and the expectations for the file to follow the TIFF specification seem reasonable. There are almost [70 known TIFF error and info messages|https://docs.google.com/spreadsheets/d/1zyg4eqH6akoehI10fNhzroaW3CgcwUjkNlvvVZvFDyo/edit#gid=537329205] in the JHOVE module, most of them carry their meaning within the message like "_TileLength not defined_" and even a passer-by with a minimum of fantasy can imagine why information about the tile length might come in handy for an image.
*{_}Handling{_}*: As the GUI output never suited me, I have long ago begun to use JHOVE as a java libary and have my own html output, which is very user-friendly. Aside from the GUI output not being handy when dealing with many files, JHOVE is very easy to install and use, as one could just throw (drag & drop) files and folders at it and it will swalllow and validate them.
h4. ImageMagick
*{_}Validation{_}*: Just to be fair, ImageMagick is not primarily about file validation, but about displaying, migrating and working on images. I have marked every file as invalid that ImageMagick had at least one error message about, even if the error seems to be a minor one, like the encounter of an unknown TIFF field or incorrect contents of tags, which then could be ignored. As far as I know there is no list of all possible error messages of ImageMagick. The corpus of this blogpost, the Google ImageTestSuite, however, consists more than 40 different error messages, [which are listed here|https://docs.google.com/spreadsheets/d/1lZmLVrK3vv2-BUxw7YtTvyo7bC6qP7twHZQzTmwz8eU/edit#gid=0].
*{_}Handling{_}*: If I would have known how bad the output is before I started this test, I would have skipped this tool altogether. But first I thought I would only have ImageMagick next to JHOVE and the DPF manager. ImageMagick is a command-line-tool and as far as I know the batch-processing only has a txt-output and this txt-output is really a mess (personal opinion). It's difficult to tell which error information belongs to which file, as some files are not even listed by name in the txt-output. I had to test those all one-by-one, which was time-consuming and boring. But I am a pighead and had already started to tell everybody I am gonna test this tool. Even after I had written some java to help me with the messy output, some stuff could not be automated, as I could not find a regular pattern for everything. I am very sure that ImageMagick is very usefull for batch-processing when converting images etc., but obviously nobody has really thought about validating 166 images at once or validation in the first place.
h4. ExifTool
*{_}Validation{_}*:[ExifTool|http://coptr.digipres.org/ExifTool] is not really meant for validation, either. It's for metadata extraction. The information about image errors is just a by-product if the tools runs into any problems while trying to extract metadata. So it's not really fair to treat ExifTool like a validation tool, as it would never complain about an absolute unreadable TIFF which cannot be opened by any viewer, as long as all the metadata can get extracted. That might be the reason why ExifTool has the highest percentage of presumably valid TIFF files within this test.
*{_}Handling{_}*: It's a command-line-tool with quite good possibilites to batch whole folders and output human-readable csv (though the csv can have many, many comlumns, as images can have a myriad of metadata).
h4. DPF Manager
{color:#000000}{*}{_}Validation:_{*}{color} {color:#000000}The DPF Manager is for TIFF validation only and was built only for that.{color}
{color:#000000}{*}{_}Handling:_{*}{color} {color:#000000}The DPF Manager is very easy to install (though you need Admin-rights to do so, it's not portable like JHOVE is) and extremely easy to use: You can just drop and drag a file or folder on the GUI or, alternatively, select a file or folder. The tool is very fast - 80 TIFFs need less than a minute - and the HTML output is very nice and there is also METS and XML (although personally I would like an additional csv). Furthermore, the TIFF files are sorted by the number of errors. The worst ones come first and when you scroll down, the TIFF get less and less invalid, ending with the valid ones in the end. So bad news first\! There are also thumbnails of the TIFFs, so one can see immediatly if an image preview is possible or not. I think in terms of usability this tool has easily won the contest.{color}
{color:#000000}As a bonus, each error is referenced to the page and section in the TIFF guide, including the exact quotation that the error refers to.{color}
!DPF.jpg|border=1!
{color:#000000}Having seen this, I tend to think of the DPF manager to be the reference if a TIFF really is valid or not considering the TIFF specification - but this would need much further testing and research, so therefore I leave it at guessing here. It certainly aims at being the go-to-validator for TIFF-files.{color}
h4. checkit_tiff
*{_}Validation{_}*: [checkit_tifff|https://github.com/SLUB-digitalpreservation/fixit_tiff] is too picky for the need of the examination for this blogpost, as it validates against baseline TIFF and obviously, the TIFFs in the Google Imagetestsuite are no baseline TIFFs. I still want to present the findings of the tool and include it in this Blogpost, as the reader surely has a different TIFF corpus and might very well want to validate baseline TIFF. The part that checks the Baseline-conformance is checkit-tiff, but I call the tool checkit_tiff in this post, as in daily life we usually call the whole bundle checkit_tiff.
*{_}Handling{_}*: There is not yet a windows version of the tool and it's a command line tool. A windows version is about to be released soon, though.
h4. LibTIFF
*{_}Validation{_}*: There are 695 different error messages in the sample ([listed here|https://docs.google.com/spreadsheets/d/1lZmLVrK3vv2-BUxw7YtTvyo7bC6qP7twHZQzTmwz8eU/edit#gid=967738359]), mostly very similar ones dealing with unknown TIFF tags. Reading the error messages, LibTIFF seems to check the tags only, and omits some general file-structure validation as the check of end-of-file tags. It does check the TIFF header for the magic number, though (see "_No TIFF magic number_"). As far as I know, there is only a txt-output (".log"), which in general is readable, but for bulk-analysis not much better than ImageMagick.
*{_}Handling:_* Well, you need Linux. At least a linux-emulator.


h3. Test corpus
The test was run on the [Google Imagetestsuite for TIFF|https://code.google.com/archive/p/imagetestsuite/], which has the advantage of being openly available and consists of some really bad TIFF files. The files are named after their MD5-checksums (see: [About|https://code.google.com/archive/p/imagetestsuite/wikis/TIFFTestSuite.wiki]).Unfortunately, there is no "last truth" somewhere out there if the TIFF file is valid or not. I have added the information if the image is renderable in either Windows Photo Preview, Paint or ImageMagick in the [Findings spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=245183935]. I know that this is not a water-proof solution as an image can be absolutely invalid and still open in a current viewer. Some of the Google Images also look bogus in the viewer: either as if something is missing or just black or white - and I cannot figure out if this is on purpose (= this just is a picture of some black stuff) or the image is broken.
Images prefixed with an "m" + Number were modified ("mutation"). Although the intention was not necessarily to add errors to the image, the percentage of valid images for these files is much smaller (see table below).
h4. Examination of 166 TIFFs from the Google Imagetestsuite
|| || JHOVE \\ || ImageMagick \\ || ExifTool \\ || DPF Mananger (Baseline) \\ || DPF Manager (Extended TIFF) \\ || checkit_tiff \\ || LibTiff \\ || Renderable in a viewer \\ ||
| *{_}all 166 files{_}* \\ | | | | | | | | |
| valid \\ | 29 \\ | 18 \\ | 56 \\ | 4 \\ | 15 \\ | 0 \\ | 21 \\ | 83 \\ |
| invalid \\ | 129 \\ | 148 \\ | 109 \\ | 151 \\ | 136 \\ | 131 \\ | 145 \\ | 83 \\ |
| could not be analysed \\ | 8 \\ | 0 \\ | 1 \\ | 11 \\ | 11 \\ | 35 \\ | 0 \\ | |
| % valid \\ | 17,5% \\ | 11% \\ | 34% \\ | 2,4% \\ | 9% \\ | 0% \\ | 13% \\ | 83% \\ |
| *{_}47 original files (not mutilated{_}*) \\ | | | | | | | | |
| valid \\ | 27 \\ | 18 \\ | 44 \\ | 3 \\ | 14 \\ | 0 \\ | 21 \\ | 47 \\ |
| invalid \\ | 20 \\ | 29 \\ | 3 \\ | 44 \\ | 33 \\ | 47 \\ | 26 \\ | 0 \\ |
| could not be analysed \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ | |
| % valid \\ | 57% \\ | 38% \\ | 94% \\ | 6% \\ | 30% \\ | 0% \\ | 44% \\ | 100% \\ |
| *{_}119 mutilated files{_}* \\ | | | | | | | | |
| valid \\ | 2 \\ | 0 \\ | 12 \\ | 0 \\ | 1 \\ | 0 \\ | 0 \\ | 36 \\ |
| invalid \\ | 109 \\ | 119 \\ | 107 \\ | 119 \\ | 118 \\ | 84 \\ | 119 \\ | 83 \\ |
| could not be analysed \\ | 8 \\ | 0 \\ | 1 \\ | 11 \\ | 11 \\ | 35 \\ | 0 \\ | |
| % valid \\ | 1,6% \\ | 0% \\ | 10% \\ | 0% \\ | 1% \\ | 0% \\ | 0% \\ | 30% \\ |
| *{_}83 non-renderable files{_}* \\ | | | | | | | | |
| valid \\ | 2 \\ | 0 \\ | 7 \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ | 0 \\ |
| invalid \\ | 23 \\ | 83 \\ | 75 \\ | 73 \\ | 73 \\ | 80 \\ | 83 \\ | 83 \\ |
| could not be analysed \\ | 5 \\ | 0 \\ | 0 \\ | 11 \\ | 11 \\ | 3 \\ | 0 \\ | 0 \\ |
| % valid \\ | 2,4% \\ | 0% \\ | 8,4% \\ | 0% \\ | 0% \\ | 0% \\ | 0% \\ | 0% \\ |
As I do not know of the holy Grail of TIFF validation, I would leave most of the thinking to the reader here, just a few obersvations:
checkit_tiff does not analyse the TIFF files if the magic number is missing.
The validation quota for the mutilated files, and, especially, for the non-renderable files is very high. Of course ImageMagick has no false positives here - as ImageMagick also is a viewer, I consider it to be renderable if ImageMagick can display it. There were some files which could not be opened with Paint but with ImageMagick, though (marked with "ImageMagick can open" in the [Spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=245183935]).
I am reluctant to state that JHOVE has two false positives here, but obviously all the other tools (except for ExifTool, though ExifTool mostly considers different files to be valid than JHOVE, see [Spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=686737214]), furthermore, no viewer can render the files. I tried to analyse the files with my own TIFF java tools, but they could not process the files (which also is a sign that something is bogus with the files). So much as I hate it, I have to admit: These are false positives. What else should I call it? I certainly would not want these two files going unnoticed in my archive. Furthermore, the other five tools all have detected that something is wrong with these files.
h6.
h4. Premature End-of-File
For 5 files of the test corpus JHOVE throws the "_Premature End-of-File_"-Error, which usually hints at a fatal error with the file. Often parts of the file are missing, a typical issue is that the file was not completely downloaded/uploaded and the last chunk of the file is not there. JHOVE usually realises this, as it is always checking if the End-of-File-tag is there or not.
Four of the five files ([spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=722911667]) do look very suspicious. Two cannot be opened, two are black, one looks as if parts of the text were missing ([file screenshot|^EOF_screenshot.jpg]). At least most tools agree that something is wrong with the files. Only the DPF manager conisders one of the 5 files to be valid. Looking at the error messages of the DPF manager, there is no hint of a premature EOF or any mention of the end-of-file or and EOF-tag at all.
h4.
h6. ImageMagick Error: "unexpected end-of-file"
A very similar error occured with five other files of the corpus ([listed here|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=722911667]). JHOVE reports other errors for these files, but at least again all validation tools agree on the invalidity of the files. They certainly look bogus and one of them cannot even be opened. The DPF manager could not even analyse these five files, they all were omitted in the analysis.
h4. No TIFF magic number
If the file only purports to be a TIFF file, e. g. by the file extension, but the magic number cannot be found, all tools agree on the error ([spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=594822972]). None of the three files reporting the error could be opened and JHOVE, ImageMagick, DPF Manager and checkit_tiff (by not handling the file) agree that the magic number is missing and that it therefore cannot be a TIFF file or rather the TIFF signature is incorrect.
h4. Commonalities in terms of errors
Sometimes, the tools agree on an error and even use very similar words to describe the error. One example is shown in the table below.
|| ImageMagick Error for file _0c84d07e1b22b76f24cccc70d8788e4a_ || JHOVE TIFF Module Error for file _0c84d07e1b22b76f24cccc70d8788e4a_ ||
| Unknown field with tag 37680 (0x9330) encountered | Unknown TIFF IFD tag: 37680 |
| Unknown field with tag 37677 (0x932d) encountered. | Unknown TIFF IFD tag: 37677 |
| Unknown field with tag 37678 (0x932e) encountered. | Unknown TIFF IFD tag: 37678 |
Obviously, both tools check for unknown TIFF tags and reports it if they encounter some. ImageMagick also gives the Hex value of the field. It does not matter which of these two tools one uses, it will always report unknown tags. At least both tools have done so with the Google ImageMagick TestSuite. In theory, there might be a TIFF file out there for which one of the tools neglects to report an unknown tag, although this is highly unlikely given the structure of a TIFF.
h4. Differences in terms of errors
h6. {color:#000000}JHOVE: "Invalid DateTime separator"{color}
{color:#000000}The JHOVE Module reports correctly if the DateTime is invalid and marks the file as "well-formed, but not valid". It has done so with the "invalid_date.tiff" from the{color}[fixit / checkit_tiff testfiles|https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/examples]{color:#000000}[https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/examples]{color}{color:#000000}. ImageMagick, however, completely neglects to realise that there is something wrong with the DateTime Tag in this file and the error goes unnoticed. (ImageMagick does report an error, which seems to be unconnected to the DateTime, however, as it is about the "Photoshop"-tag.) The DPF manager also reports "Incorrect format for DateTime" and quoted the TIFF specification, so this is a false positive for ImageMagick.{color}
h6. {color:#000000}JHOVE: "Value offset not word-aligned"{color}
{color:#000000}The JHOVE module throws this error for the "minimal_valid"-Tiff in the checkit_tiff Examples and marks the TIFF as "not well-formed". ImageMagick does not report any errors for this file. The DPF Manager, however, finds three errors in the file, two related to "bad word alignment in offset" (which sounds pretty much like the JHOVE error) and one inconsisty about the tag planar configuration, which does not sound that fatal{color} {color:#000000}("{color}{color:#000000}{_}PlanarConfiguration is irrelevant if SamplesPerPixel is 1, and need not be included._{color}{color:#000000}").{color}
h4. {color:#000000}Fun fact{color}
{color:#000000}Of the 166 files, only for four files all the tools (except checkit_tiff, which considers them all to be invalid) agree on validity ({color}{color:#000000}[spreadsheet|https://docs.google.com/spreadsheets/d/1AsJNXEjJlfYau1JmCj7YiEr1WNNlA-hHvXoJ4O3A7nw/edit#gid=1005295245]{color}{color:#000000}). If one would decided on a file validity policy which only allows files in an archive for which no tools has any complaints, it would be a very empty archive indeed. It might not even be possible to satisfy them all with real-life-images from different producers.{color}
h4. Summary and conclusion
Although the tools agree on the "real bad" TIFF files, TIFF validation does not seem to be at all that easy-peasy. It has been much easier - at least with the corpus analysed - to determine what is a false positive and what is a false negative with the JPEGs [in my last OPF Blogpost|http://openpreservation.org/blog/2016/11/29/jpegvalidation/]. The JHOVE TIFF module still seems to be a decent choice and I have not found any real gap like I did with the JHOVE JPEG module the other day, althoug the two false positives leave me nervous.
Findings of the DPF manager seem to be trustworthy to me, as the TIFF specification can be referenced for each error found. Nevertheless, most of the tools - if not all - seem to be too paranoid. Assuming all non-mutilated TIFF are valid (which are all renderable in a viewer), only ExifTool considers 94% of them to be valid (or, "error-free", as in the case of ExifTool). The second-best, JHOVE, still considers almost half of them to be invalid in some way. The DPF manager considers only 30% of them to be valid (Extended TIFF) and even is able to prove every bit of it.
Back to my reasearch questions:
_Is the JHOVE TIFF module really that good in comparison with other tools?_
Well. It's pretty user-friendly, the error messages are intelligible (but most TIFF errors are, with every tool tested), the output can be dealt with, but it's not as user-friendly as the DPF manager, which also has a nicer output. And, the DPF manager has the reference to the specification all the time, which really feels good when talking to my boss about the quality of our TIFF files. Look, the TIFF bible says it's ok / not ok. Who would argue?
Nevertheless, it was the only (real validation) tool with false posivites with perfectly invalid and un-renderable files, which would be worth a second look in one of my next posts.
And, as a side-effect: _Is TIFF validation really easy peasy?_
It does not seem so, as the validators agree on very little indeed.
So, how to act?
I might just stick to JHOVE in our productive digital preservation environment, but I will at least add the DPF manager in our Pre-Ingest workflows, especially in our digitisation centre, to be sure we stick to the TIFF specification at least with TIFF files we generate ourselves. When receiving files from outsiders, I will be more tolerant, as I always am, but might add a preservation planning workflow to repair the TIFFs, if possible. But that will be the topic of another post at another time.