|This page is under construction. Please feel free to add, edit and correct.|
|This page is intended to capture error messages found during testing with JHOVE. It also contains some broad PDF knowledge.|
- PDF module
- Cross-reference tables
- XMP metadata
- PDF header
- PDF trailers
- Pages and page trees
- PDF objects
- Invalid characters, syntactic errors
- Issues with colour management
- Interactive content
- TIFF module
Please note that there has not been an update of JHOVE (yet) since PDF 1.6. This is why JHOVE cannot determine the validity for PDF 1.7 (and higher) for sure, although it gives a nice clue about it of course. For the same reason JHOVE cannot really deal with PDF/A-2, as this is built on PDF 1.7. It's dave to say that JHOVE is not meant for profile checking anyway, so a PDF/A-validation should be done with other tools.
JHOVE can throw two different types of exception: a PdfMalformedException and a PdfInvalidException.
To be considered well-formed by JHOVE, a PDF must consist of:
- a PDF header (e.g. %PDF-1.0)
- an end-of-file marker (i.e. %%EOF)
- a body consisting of well-formed objects
- a cross-reference table
- a trailer defining the cross-reference table size
- an indirect reference to the document catalog dictionary
A valid PDF must be well-formed, and fulfill the following criteria:
- The document structure conforms to the specification. This includes (when present) outlines, pages, the page label tree, attributes, resources, role maps, name trees...
- Version information in the document catalog dictionary, if present, is properly formed.
- Dates are properly formed.
- File specifications are properly formed.
- Any annotations are properly formed.
- Any ArtBox, BleedBox, MediaBox and TrimBox items are PDF rectangles.
- XMP data, if present, are well-formed.
A PDF file consists of PDF objects referenced by PDF dictionaries.
A PDF dictionary is a collection of objects indexed by name, or name--value pairs.
PDF dictionaries are embedded between "<<" and ">>" elements. The below example has been broken onto multiple lines for clarity:
Each dictionary entry consists of a pair of objects. The first object should be a name object, which begins with a slash ("/"), and is followed by a value, which can be any kind of PDF object.
In the above example we see the following:
- /Subtype paired with another name object,
- /Length paired with a numeric object,
- /Filter paired with an array of name objects; and,
- /Metadata paired with an indirect object reference.
In theory, it is possible to add custom entries, but they will be ignored by Acrobat Reader. For long-term availability, this does not seem to be a good idea anyway.
Following, all the error messages concerning the PDF dictionary are listed, including the link to GitHub for further explanation and possible example PDF files.
Non embedded fonts are one of the biggest risks for the correct rendering of PDF files. If one of the used fonts is not embedded ind the PDF and the rendering device does not have the font, the PDF might not be rendered as the data producer once has intended. It even can lead to missing text, gaps within words or text shifting. The worst case would be that part of the text cannot be displayed correctly any more.
Some fonts cannot be embedded for copyright reasons. Furthermore, there can be name conflicts. Somebody saves his font as "myfont", does not embedd the fonts and the rendering device also has a font named "myfont" and chooses this font to render the text – which is indeed a very different font and changes the visual impression of the PDF a lot.
It is not mandatory for ISO-3200 to mbedd fonts. A non-embedded font does not necesarily lead to an invalid PDF. With PDF/A, however, this is different, every used font has to be embedded.
Therefore, an absolute valid PDF can be at risk for long-term availability if the fonts are not embedded. Here is an extrem example (from a slide from the PDF Days in Baseln in 2012):
The cross-reference table serves as an index for all the objects in a PDF file. Each item is shown with a "byte offset": the exact number of bytes from the beginning of the file to where the object begins. This allows software to find an object within a PDF file without having to scan the whole PDF. It is like an exact address within the PDF file.
XMP (eXtensible Metadata Platform) metadata is based on XML and can be found not only in PDF, but also TIFF, JPEG, and other file formats. The most popular XMP scheme is Dublin Core, but there are others as well. XMP metadata is possible since PDF 1.4, earlier versions should not contain XMP metadata. There is an SDK (Software Development Kit) to work with XMP directly from Adobe.
PDF/A asks for certain XMP metadata, usually Preflight will fix that easily.
The header is usually 1 or 2 lines. The first is mandatory and can look like this: %PDF-1.7
The first five bytes should be "%PDF-", followed by the PDF version number, such as "1.7" above.
The second line is optional and should contain at least four bytes of binary data, allowing other software, like e-mail or file-transfer clients, to categorise the file as binary instead of plain text.
The trailer is the entry point into the document's structure and should be located at the very end of a PDF file. A PDF that has been incrementally updated can have multiple trailers.
Each trailer should consist of a dictionary object, the byte offset to its cross-reference section, and an end-of-file marker.
A trailer dictionary should contain the total number of objects in the PDF at the time it was written ("Size"), a reference to the document catalogue ("Root"), a reference to the previous trailer if one exists ("Prev"), and a few other optional entries.
Some introductory text is missing.
The error message is from the java class PageTreeNode. This example PDF ( Beispiel-PDF ) may be used as an example, as it was created for test purposes. It contains the error "improperly constructed page tree" twice and does not bear any other error messages, JHOVE considers it to be "not well-formed". It seems as if this error is always thrown twice.
During the PDF Hackathon der OPF (Open Presentation Foundation and ZBW Leibniz Information Centre for Economics and Goportis) in Hamburg Olaf Drümmer (PDF Association) talked about an interesting JHOVE false alarm:
The pages of a PDF file are most commonly saved as a Page Tree to be able to go to a certain page as fast as possible. This is often enabled as a balanced page tree. The PDF standard shows this possibility, but does not make it mandatory. It is perfectly valid to save the pages in a simple array of pages and totally within the standard. It's less efficient (bad performance), especially if the PDF contains of many pages. JHOVE considers it as an error if the pages are saved in an array instead of a balanced page tree. However, this is not an error and such PDF do not bear a risk for Longterm availability. This message can be ignored.
Quoting the PDF-Standard (ISO 32000-1 aka PDF 1.7) 7.7.3 Page Tree / 22.214.171.124 General:
"NOTE: The simplest structure can consist of a single page tree node that references all of the document’s page objects directly. However, to optimize application performance, a conforming writer can construct trees of a particular form, known as balanced trees. Further information on this form of tree can be found in Data Structures and Algorithms, by Aho, Hopcroft, and Ullman (see the Bibliography)."
That only underlines the usefulness of page trees. Leonard Rosenthol suggests to use page trees for PDFs with more than 50 pages (Developing with PDF: Dive Into The Portable Document Format by Leonard Rosenthol, page 24). Olaf Drümmer has heard about a test which had the finding that it should be make up from 64 pages, but this depends highly on the material itself and might not be generalised.
In general there are 8 object types and one special type (so 9 in all) that are supported by the PDF format. Six are scalar types (contain only one value/object) and three are container types that can contain multiple value. These are dictionary, array and stream. There are tools from Adobe which can be used for object analysis.
- Boolean Objects: True or false
- Numeric Objects: Integer or real numbers
- String Objects: A sequence of 8-bit bytes, which represent text: Literal Strings, Hexadecimal Strings. PDF 1.7 allows for Text Strings, PDFDocEncoded Strings, ASCII Strings & Byte Strings.
- Name Object: Charakerfolge, die mit einem Slash („/“) eingeleitet wird. Leerzeichen und einige bestimmte Delimeter-Charaktere sind in Namen nicht erlaubt, können aber dargestellt werden, indem stattdessen der korrespondierende Hexadezimalcode verwendet wird.
- Array Object: Only one-dimensional arrays. All object types in an array are possible, even other arrays. Always displayed with [ ] .
- Dictionary Objects:
- Stream Objects: A sequence of bytes, the length is unlimited, contrary to string objects. A stream object always starts with a dicitonary entry, which describes the byte.-sequence (size, filger, decoding parameter) and afterwards the stream, which is interposed between "stream" and "endstream". An example:
2 0 obj
/F1 12 Tf
72 712 Td (A short text stream.) Tj
8. Null Object: It is usually recommended to erase an object completely instead of put it to "null". In the JHOVE code there are many "==null"-tets, which causes expections, if the object equals null.
9. Indirect Object:
java.lang.ClassCastException: PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
| This does not shown in the GUI, only in the java-library-version
I have a long German explanation which I can translate someday.
Seems to be a JHOVE bug and not a real PDF error.
| Example from the BSB.
Another example in a forum PDF.
All annotations need to be well-formed. This is quite similar to the definition of a well-formed XML, but as an XML usually is far less complex, it is easier to tell and to parse.
Introductory text is missing
The PDF format works with image data streams and not with image file formats. The most important filters/compressions are:
- 1-bit data: Fax-compression group 3 or 4, JBIG2
- Greyscale, RGB or CMYK data: JPEG, JPEG 2000 (DCTDecode is the filter JPEG uses)
- usable for all kind of image data: ZIP
- alternatively LZW can be used, though this is not possible in PDF/A-1 as the patent only expired in 2004
- RLE (Run Length Encoding) is possible, but is uncommon due to its inefficiency
It is possible to embed the kind of data stream in a PDF which would also be used by a JPEG or JPEG 2000 file. Only the data stream is used which deals with the image itself, no information like metadata is added to that.
A TIFF image would be stored in a PDF e.g. like a JPEG, a TIFF itself cannot be embedded 1 to 1 in a PDF (which is possible with a JPEG).
Interactive content often depends on external information, which can lead to problems and limited functionality. Sometimes fill-in-forms are presented differently.
In general, JHOVE can deal with password protected PDF files. This does not lead to invalidity (exception: PDF/A). The boolean value "_encrypted" just is set on true. Some JHOVE versions even return this value in the output (German National Library's version, mine does not). So it should be possible to use JHOVE just to determine password protection, but of course JHOVE might be too "big" for such a relatively small task.
See "PDF objects" above Example 1, there can be 9 different types of object (8 direct types, one indirect type).
In all likelyhood, this object falls under "other objects (type9)" (see PDF spec F.3.10), quoting: "Named destinations: These objects include the value of the Dests or Names entry in the document catalogue and all the destination objects that it refers to; see
G.3, "Opening at an Arbitrary Page""
To guess around, this is an object which includes a destination, but that destination does not exist or is not correct.
Information about destination in "PDF Explained" (John Whitington):
destination defines a place in a PDF file, consisting:
- page number
- position within that page
- magnification to use when viewing that page
Destinations can either be defined explicitly or referenced by a name (and be looked up in the name-tree, that lists all destinations).
so, new guess: Name of the destination might be invalid or not be found in the name-tree
Destinations are defined using an array object. There is a syntax in the book how the destination syntax is summarized.
So, new guess: The destination syntax might be wrong.
TODO: Someone should take a look at the java code to offer a more educated guess.
Information from "Developing with PDF" (L. Rosenthol):
He also says it refers to a certain page and smaller subsection of the page.
Destinations are values of keys in specific dicitonaries related to parts of PDF. E. g. the "openAction" key makes you jump to the first page when opening the PDF.
Explicit destinations: based on an array:
- 1st element: always an indirect reference to the page
- followed by a name object describing the type of zoom
- additional options needed for that zoom
There also is a relationship between the name (e. g. a string object) and the destination.
Another guess: Something is wrong with the relation, the name or the destination is invalid or one of them is missing.
Carl's practice experience: Usually it's the whole cross-reference table that's messed up somehow.
With the Example 2 it is quite similar, see Screenshot.
Run PDF with JHOVE's PDF module
Detect the Error: "Invalid indirekt destination - referenced object "WEBend-a1 cannot be found"
Open PDF in Hex Viewer.
Find WEBend-a1 reference, but find not the destination - obviously it's missing.
We seem to understand the error here, but we do not know yet if this kind of error is common for this error message or it's just that example. Carl thinks it is more complex and will manipulate some PDFs and test them afterwards. Pete and Yvonne look for other real-life-PDFs with that kind of error.
Information, that might still be missing in the spreadsheet:
| Unexpected error while constructing a destination object; or...
There are several valid destination objects:
An unnamed, direct destination, which refers to the page object.
An unnamed, indirect destination, which refers to a named, direct destination, which refers to the page object.
If it is no PdfArray and no PdfDictionary, this error is thrown. Can occur more than once in a PDF file.
| PDF opens and looks fine wiht Adobe Acrobat XI Pro. No errors when the PDF 1.5 is tested with Preflight against Adobe 6.
java.lang.OutOfMemoryError: JHOVE can run out of memory space during the PDF examination. Some examples are listed in the SourceForge Bug Reporter. The PDF might be perfectly valid, there is just too much space needed to validate. A possible reason might be a very big dictionary because of very many images. 10,000 images are no problem, but an unlimited number of images can lead into problems, if the PDF is built from very many images. Following a use case of the DNB Germany:
Very big dictionary because of too many listed pictures workaround
The German National Library in Frankfurt has found that JHOVE causes the java heap space to run full if there are too many listed pictures in the PDF Dictionary. They have developed a workaround for this issue to keep java from failing.
The DEFAULT_MAX_IMAGES depend. PDF/A allows 4095 entries. Tests have shown that 10,000 would alos be OK. But no limit causes a heap space error around 1,251,900 entries. This will surely have more dependencies, so these numbers are from a test the German National Library has conducted.
The following errors have been experienced with image-based materials:
- Technical MD Extract:Fail - Error/s returned during metadata extraction (ColorSpace value out of range: 2)
- Error analysis: This error occurred on a TIF file. JHOVE expected to see either value: “1” or “65535” (based on the TIFF specification). Instead the value it was encountering was "2".
- Technical MD Extract:Fail - Error/s returned during metadata extraction (FocalPlaneResolutionUnit value out of range: 4)
- Error analysis: This error occurred on a TIF file. JHOVE expected to see a value in the range of: 1 - 3 (based on the TIFF specification). The value appearing is 4.
- Technical MD Extract:Fail - Error/s returned during metadata extraction (PhotometricInterpretation not defined,ImageWidth not defined,ImageLength not defined,Neither strips nor tiles defined,Neither strips nor tiles defined)
- Error analysis: This error occurred on a TIF file. File was missing critical information and so image did not render (however it was not clear from this error message that the issue would result in a file not rendering.)
- Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 306; expecting 20; saw 19,Failed to retrieve extractor properties)
- Error analysis: This error occurred on a TIF file. File should contain 20 bytes however there were only 19 (and so it did not meet the ISO datetime standard).
- Technical MD Extract:Fail - Error/s returned during metadata extraction (FileSource value out of range: 77)
- Error analysis: This error occurred on a TIF file. File should contain the value 3 or 7 for this field. Instead it contains the value 77.
- Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 36867; expecting 20; saw 11,Failed to retrieve extractor properties)
- Error analysis: This error occurred on a TIF file. File should contain 20 bytes however there were only 11 (and so it did not meet the DateTimeOriginal standard for the field, as per the TIFF specification. The TIFF spec states: "When the field is left blank, it is treated as unknown.").
- Technical MD Extract:Fail - Error/s returned during metadata extraction (Tag 34665 out of sequence)
- Error analysis: This error occurred on a TIF file. Issue hasn't yet been fully analysed, however info can be found at TIFF specification: http://www.awaresystems.be/imaging/tiff/tifftags/exififd.html