|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (1)
View Page History

h3. Error detection of JPEG files with JHOVE and Bad Peggy - so who's the real Sherlock Holmes here?
This test was done with the publicly available [Imagetestsuite from Google|https://code.google.com/archive/p/imagetestsuite/]. Furthermore, I included pictures from events like Christmas parties and outdoor events in my library from the last 6 years, pictures contributed by friends and colleagues, some of my own pictures, and even some memes from public fun Facebook pages like "useless facts".
In one case, I even opened a JPEG in an editor to remove some bytes to test if the tools would realise that something was wrong, hoping I would get error messages that were still missing from the list (which, by the way, worked out).
All in all almost 3000 JPEG files were tested. In the following, I will focus on the files that either the [JHOVE|http://jhove.openpreservation.org/] [JPEG module|http://jhove.sourceforge.net/jpeg-hul.html] or [Bad Peggy|http://coptr.digipres.org/Bad_Peggy] \- or both - found objectionable in one way or the other.
In general, the JHOVE JPEG module knows eleven different error messages, whereas Bad Peggy can distinguish at least 30 ([source code of KOST-Val|https://github.com/KOST-CECO/KOST-Val/blob/master/KOST-Val/src/main/java/ch/kostceco/tools/kostval/validation/modulejpeg/impl/ValidationAvalidationJpegModuleImpl.java#L536], which uses Bad Peggy to validate JPEG files).
h4. _Table: Possible JHOVE errors on JPEG files_
|| || Constant \\ || Message \\ || Examples in the sample? \\ || Comment \\ || Bad Peggy equivalent \\ ||
| 1 \\ | ERR_DTT_SEG_MISSING_PREV_DTI | DTT segment without previous DTI | no \\ | | |
| 2 \\ | ERR_EOF_UNEXPECTED | Unexpected end of file | Yes ([Example|^ImageTestSuite_Unexpected.jpg]) \\ | | corrupt data: premature end of data segment. AND \\
*corrupt data: Truncated File - Missing EOI marker.* \\ |
| 3 \\ | ERR_EXIF_PROCESSING_IO_EXCEP | I/O exception processing Exif metadata: | no \\ | | |
| 4 \\ | ERR_HEADER_INVALID | Invalid JPEG header | Yes ([Example|^ImageTestSuite_Header.jpg]) | | The file is not a JPEG (header). |
| 5 \\ | ERR_JFIF_APP_MARKER_MISSING | JFIF APP0 marker not at beginning of file | Yes (see "Expected marker byte 255, got") | | {color:#339966}{_}Bad Peggy does not recognise this file as invalid{_}{color} |
| 6 \\ | ERR_MARKER_INVALID | Marker not valid in context | Yes ([Example|^ImageTestSuite_Marker.jpg]) | | invalid file structure: two SOI markers \\ |
| 7 \\ | ERR_MARKER_MISSING | Expected marker byte 255, got | Yes ([Example|^Profilbild.jpg]) \\ | | {color:#339966}{_}Bad Peggy does not recognise this file as invalid{_}{color} |
| 8 \\ | ERR_SPIF_MARKER_MISSING | SPIFF marker not at beginning of file | no \\ | | |
| 9 \\ | ERR_START_SEGMENT_MISSING | File does not begin with SPIFF, Exif or JFIF segment | Yes (see "Expected marker byte 255, got") | | {color:#339966}{_}Bad Peggy does not recognise this file as invalid{_}{color} |
| 10 \\ | ERR_TEMP_FILE_CREATION | Error creating temporary file. Check your configuration: | no \\ | | |
| 11 \\ | ERR_TILING_DATA_UNRECOGNISED | Unrecognized tiling data | no \\ | | |
| 12 \\ | | {color:#000000}Value offset not word-aligned: xxx{color} | Yes ([Example|^ImageTestSuite_6Offsets.jpg]) | Is not (yet) part of the documentation, but was thrown for some files in the sample \\ | Bad Peggy considers these files to be invalid as well, but throws different error messages \\ |
| 13 \\ | | {color:#000000}No TIFF magic number: 4906{color} | Yes ([Example|^ImageTestSuite_NoTiff.jpg]) | Is officially a TIFF error message, but was thrown for this presumably JPEG file \\ | corrupt data: bad Huffman code. _(only one file in the sample)_ \\
\\ |
So for five of the error messages no examples could be found, so I won't look at these errors in this blog post. Furthermore, for the two additional error messages, the examples in the sample were so scarce that I cannot possibly explain them properly yet. If anybody out there has examples for these errors, I will happily extend this post and include the findings.
Let's take a closer look at the different errors.
h5. General information about the JPEG structure
Between the SOI (Start of Image) and the EOI (End of Image), there are other segments allowed, roughly said the structure of a JPEG should be as following:
* SOI-segment (SOI: start of image): "FF D8"
* APP0-segment (JFIF-Tag): "FF E0"
* other segments
* SOS-segment (SOS: start of scan): "FF DA"
* data: compressed data
* EOI-segment (EOI: end of image): "FF D9"
h5. Unexpected end of file
Interestingly, JHOVE does not always detect if parts of the files are missing. Only for files where Bad Peggy throws two errors: "_corrupt data: premature end of data segment_" AND "_corrupt data: Truncated File - Missing EOI marker_", will the JHOVE JPEG module detect that something is indeed wrong with the file. For several files Bad Peggy only throws "_corrupt data: premature end of data segment_" and JHOVE considers them to be valid.
Furthermore, the Bad Peggy error "_corrupt data: xxxx extraneous bytes before marker 0xd9._" goes unnoticed by JHOVE. Images in this sample with these errors, though, do not look healthy to me at all (see example and screenshot in the conclusion). There is clearly something missing, as you can see with these three examples, which are considered to be valid by JHOVE:
|| Name || Bad Peggy Error \\ || Impact \\ ||
| image195 \\ | corrupt data: 83426 extraneous bytes before marker 0xd9. | color problems, picture seems to have two parts that do not belong together \\ |
| image185 \\ | corrupt data: premature end of data segment. | color problems, picture seems to have three parts \\ |
| image183 \\ | corrupt data: 19846 extraneous bytes before marker 0xd9. | color problems, picture seems to have two parts \\ |
As a digital archivist, I would want to know about these errors while ingesting data. I fully agree with Bad Peggy - this data is indeed corrupt. I consider this as a false positive finding of the JHOVE JPEG module: The JPEG has serious problems, but JHOVE does not detect them. In this case, Bad Peggy is the better Sherlock Holmes.
The last Bytes of a JPEG should look like this and always end with and EOI (end of image), which is "FF D9":
!visible_EOI.jpg|border=1!
In this example, the JPEG just ends without the necesary EOI:
!EOI_missing.jpg|border=1!
h5. Invalid JPEG header
Easy and straightforward: Both tools check the JPEG header and throw an error if there is no correct JPEG header. This is extremely useful, as tools usually cannot open files with a missing JPEG header. In most cases the file is unreadable for good - or it's not even a JPEG in the first place.
h5. JFIF APP0 marker not at beginning of file
After the SOI ("FF D8"), an APP0-segment should follow, which always starts with "FF E0" (see: "General information about the JPEG structure"). If this error is thrown, the APP0-segment does not follow directly after the SOI-segment. A JPEG file which throws this error viewed in a Hex editor shows that the correct APP0-segment cannot be found at the beginning; there is an SOI marker followed by "FF EE" (not "FF E0").
!startofJPEGfile.jpg|border=1!
Bad Peggy, however, completely ignores this error and obviously does not test it. The JPEG standard clearly states that JPEG files have to be structured like this, but so far, none of the JPEG files of the sample have caused any problems in commonly used viewers. This cannot be marked as a false negative for the JHOVE JPEG module, but currently does not seem to bear any practical risks for the affected data.
h5. Marker not valid in context
The JHOVE error seems far too general, but the Bad Peggy equivalent is almost immediately comprehensible: "_invalid file structure: two SOI markers_". As seen above, a JPEG starts with one SOI-segment. Searching for the SOI-segment in an affected file has shown that it had a second SOI-segment later in the file where it obviously does not belong.
h5. Expected marker byte 255, got xxx
This error occurs several times within the sample and gives a plethora of marker bytes which have been used instead of 255. So far, none of the affected JPEGs have shown any problems and Bad Peggy ignores the error altogether. {color:#ff0000}{*}TODO: Check standard what exactly is wanted here and why.*{color}
h5. File does not begin with SPIFF, Exif or JFIF segment
A JPEG file usually uses the graphic format JPEG Interchange Format (JFIF), but can also use Exif or [SPIFF|http://www.digitalpreservation.gov/formats/fdd/fdd000019.shtml] \- but obviously has to start with one of these three segments and no other. Bad Peggy marks these files as invalid as well, but the error message is quite tight-lipped "ype." - which was translated by the [KOST|http://kost-ceco.ch/cms/index.php?kost_val_de] (Switzerland) as something like "This JPEG contains characteristics that are not supported" - which does not really enlighten me. Looking at an affected file via a Hex editor has not brought any clues here, as the tags at the beginning look okay. At least both tools state that something is wrong here.
!FileBegin.jpg|border=1!
h5. Bad Peggy: Invalid file structure: Missing SOI between two EOI thumbnail markers.
Bad Peggy also detects an error that is completely ignored by the JHOVE JPEG module, which Bad Peggy has found for more than 100 files within the sample. This error is almost self-explanatory, knowing what an SOI (start of image) and EOI (end of image) is. So far, none of the JPEGs look bogus in any way or had any problems to be displayed. {color:#ff0000}{*}TODO: Check standard what exactly is wanted here and why.*{color}
h3. Conclusion
After a closer look at the affected JPEG data I would not want these JPEG files being unnoticed in my archive:
!badExamples.jpg|border=1!
Two of the images cannot even be opened and displayed any more and the rest has missing parts, mixed up parts and colour problems. For practical reasons, I would want my tool to detect the errors automatically and not necessarily more than those. {color:#000000}These are the only JPEGs that obviously have problems, others show errors in JHOVE or Bad Peggy or both, but contemporary tools have no problems displaying the JPEGs. Of course it is impossible to say if future tools will be able to display these JPEGs properly.{color}
Considering this, Bad Peggy has clearly won: It detects them all.


The JHOVE JPEG module misses 7 out of 18 - which is the Bad Peggy error "_corrupt data: premature end of data segment_" without the additional error "_corrupt data: Truncated File - Missing EOI marker_" and "_xxxx extraneous bytes before marker 0xd9._" Maybe JHOVE would be just fine if these two extra tests would be included. If there is seriously other stuff missing - well, maybe we'd need a bigger sample to examine to be able to answer this question.