Draft JHOVE TIFF module vs. ImageMagick

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (14)

View Page History
h3. Error detection of JPEG files with JHOVE and Bad Peggy - so who's the real Sherlock Holmes here?

This test was done with the publicly available [Imagetestsuite from Google|https://code.google.com/archive/p/imagetestsuite/]. Furthermore, pictures from events like christmas parties and outdoor events in my library from the last 6 years, pictures contributed by friends and colleagues and some own pictures and some Memes from public Fun Facebook pages like "useless facts". In one case, I even opened a JPEG in an editor to remove some bytes to test if the tools would realise that something was wrong, hoping I would get error messages that were still missing from the list (which, by the way, worked out).
This test was done with the publicly available [Imagetestsuite from Google|https://code.google.com/archive/p/imagetestsuite/]. Furthermore, I included pictures from events like Christmas parties and outdoor events in my library from the last 6 years, pictures contributed by friends and colleagues, some of my own pictures, and even some memes from public fun Facebook pages like "useless facts".

In one case, I even opened a JPEG in an editor to remove some bytes to test if the tools would realise that something was wrong, hoping I would get error messages that were still missing from the list (which, by the way, worked out).

All in all almost 3000 JPEG files were tested. In the following, I will focus on the files that either the JHOVE JPEG module or Bad Peggy - or both - found objectable in one way or the other.

All in all almost 3000 JPEG files were tested. In the following, I will focus on the files that either the JHOVE JPEG module or Bad Peggy - or both - found objectionable in one way or the other.

In general, the JHOVE JPEG module knows eleven different error messages, whereas Bad Peggy can distinguish at least 30 ([source code of KOST-Val|https://github.com/KOST-CECO/KOST-Val/blob/master/KOST-Val/src/main/java/ch/kostceco/tools/kostval/validation/modulejpeg/impl/ValidationAvalidationJpegModuleImpl.java#L536], which uses Bad Peggy to validate JPEG files).

So for five of the error messages no examples could be found, so I won't look at these errors in this blog post. Furthermore, for the two additional error messages, the examples in the sample were so scarce that I cannot possibly explain them properly yet. If anybody out there has examples for these errors, I will happily extend this post and include the findings.

Let's get take a closer look at the different errors.

h5. Unexpected end of file
| image185 \\ | corrupt data: premature end of data segment. | color problems, picture seems to have three parts \\ |
| image183 \\ | corrupt data: 19846 extraneous bytes before marker 0xd9. | color problems, picture seems to have two parts \\ |
As I digital archivist, I would want to know about these errors while ingesting data. I fully agree with Bad Peggy - this data is indeed corrupt. I consider this as a false positive finding of the JHOVE JPEG module: The JPEG has serious problems, but JHOVE neglects to does not detect them. In this case, Bad Peggy is the better Sherlock Holmes.

{color:#ff0000}{*}TODO: Look at a few examples and show how it looks if the EOI segment is missing{*}{color}
h5. Invalid JPEG header

Easy and straightforward: Both tools check the JPEG header and throw an error if there is no correct JPEG header. Which This is extremely usefull, as tools usually cannot open files with a missing JPEG header. In most of the cases the file is unreadable for good - or it's not even a JPEG in the first place.

h5. JFIF APP0 marker not at beginning of file
* EOI-segment (EOI: end of image): "FF D9"

If this error is thrown, the APP0-segment does not follow directly after the SOI-segment. A JPEG file which throws this error viewed in a Hex editor shows that indeed, the correct APP0-segment cannot be found at the beginning.

!startofJPEGfile.jpg|border=1!
h5. Expected marker byte 255, got xxx

This error occurs several times within the sample and gives a plethora of marker bytes which have been used instead of 255. So far, none of the affected JPEGs has have shown any problems and Bad Peggy ignores the error altogether. {color:#ff0000}{*}TODO: Check standard what exactly is wants here and why.*{color}


h5. File does not begin with SPIFF, Exif or JFIF segment

A JPEG file usually uses the grafic graphic format JPEG Interchange Format (JFIF), but can also use Exif or SPIFF - but obviously has to start with one of these three segments and no other. Bad Peggy marks these files as invalid as well, but the error message is quite tight-lipped "ype." - which was translated by the [KOST|http://kost-ceco.ch/cms/index.php?kost_val_de] (Switzerland) as something like "This JPEG contains characteristics that are not supported" - which does not really enlighten me more. Looking at an affected file via a Hex editor has not brought any clues here, as the tags at the beginning look quite okay. At least both tools state that something is wrong here.


h5. Bad Peggy: Invalid file structure: Missing SOI between two EOI thumbnail markers.

Bad Peggy also detects an error that is completely ignored by the JHOVE JPEG module, which Bad Peggy has found for more than 100 files within the sample. This error is almost self-explanatory, knowing what a an SOI (start of image) and EOI (end of image) is. So far, none of the JPEGs look bogus in any way or had any problems to be displayed. {color:#ff0000}{*}TODO: Check standard what exactly is wants here and why.*{color}

h3. Conclusion
!badExamples.jpg|border=1!

Two of the images cannot even be opened and displayed any more and the rest has missing parts, mixed up parts and colour problems. For practical reasons, I would want my tool to detect the errors automatically and not necessarily more than those - although one those. {color:#ff0000}One could argue that hurting the JPEG standard in a way that contemporary tools can deal with it but future readers will not be able to cope with these errors.{color}

Considering this, Bad Peggy has clearly won: It detects them all.

The JHOVE JPEG module misses 7 out of 18 - which is the Bad Peggy error "_corrupt data: premature end of data segment_" without the additional error "_corrupt data: Truncated File - Missing EOI marker_" and "_xxxx extraneous bytes before marker 0xd9._" Maybe JHOVE would be just fine if these two extra tests would be included. If there is seriously other stuff missing - well, mabye maybe we'd need a bigger sample to examine to be able to answer this question.