|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (3)
View Page History

{info}This page is intended to capture error messages found during testing with JHOVE. It also contains some broad PDF knowledge.{info}
{note:title=To do}
* Add missing errors.
* Expand and improve explanations.
* Maybe there could be some kind of impact gamut?
* The possible cures should be confirmed. Some are just guesses.
{note}
{toc:maxLevel=5}
h2. PDF module
Please note that there has not been an update of JHOVE (yet) since PDF 1.6. This is why JHOVE cannot determine the validity for PDF 1.7 (and higher) for sure, although it gives a nice clue about it of course. For the same reason JHOVE cannot really deal with PDF/A-2, as this is built on PDF 1.7.
JHOVE can throw two different types of exception: a _PdfMalformedException_ and a _PdfInvalidException_.
h3. Well-formedness
To be considered well-formed by JHOVE, a PDF must consist of:
* a PDF header (e.g. %PDF-1.0)
* an end-of-file marker (i.e. %%EOF)
* a body consisting of well-formed objects
* a cross-reference table
* a trailer defining the cross-reference table size
* an indirect reference to the document catalog dictionary
h3. Validity
A valid PDF must be well-formed, and fulfill the following criteria:
* The document structure conforms to the specification. This includes (when present) outlines, pages, the page label tree, attributes, resources, role maps, name trees...
* Version information in the document catalog dictionary, if present, is properly formed.
* Dates are properly formed.
* File specifications are properly formed.
* Any annotations are properly formed.
* Any ArtBox, BleedBox, MediaBox and TrimBox items are PDF rectangles.
* XMP data, if present, are well-formed.
h3. Dictionaries
A PDF file consists of PDF objects referenced by PDF dictionaries.
A PDF dictionary is a collection of objects indexed by name, or name–value pairs.
PDF dictionaries are embedded between "<<" and ">>" elements. The below example has been broken onto multiple lines for clarity:
{code:language=none|title=Dictionary example}
<<
/Subtype /OpenType
/Length 886
/Filter [/FlateDecode /LZWDecode]
/Metadata 48 0 R
>>
{code}
Each dictionary entry consists of a pair of objects. The first object should be a name object, which begins with a slash ("/"), and is followed by a value, which can be any kind of PDF object.
In the above example we see the following:
* {{/Subtype}} paired with another name object,
* {{/Length}} paired with a numeric object,
* {{/Filter}} paired with an array of name objects; and,
* {{/Metadata}} paired with an indirect object reference.
In theory, it is possible to add custom entries, but they will be ignored by Acrobat Reader. For long-term availability, this does not seem to be a good idea anyway.
h4. "Missing dictionary in document node"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [DocNode, line 104|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L104] | PdfMalformedException | A page or page tree is missing its dictionary. All pages and page trees require a dictionary, which provides access to their resources and other attributes. | The page or any pages descending from the page tree will be inaccessible and may not appear in a reader. | Is it possible to build a page's dictionary after the fact? Maybe iText can fix it. We (at ZBW) have an iText-Tool, which just copies each page into a new PDF. The PDF structure gets repaired by this procedure and I would guess that it would build a brand new PDF Dictionary for the PDF. I do not have any example on hand, though, so I cannot check. | |
h4. "Invalid Resources Entry in document"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [DocNode, line 112|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L112] | PdfInvalidException | | | | |
| [DocNode, line 115|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L115] | PdfInvalidException | | | | |
h4. "Missing expected element in page number dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PageLabelNode, line 178|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageLabelNode.java#L178] | PdfInvalidException | | | | |
h4. "Invalid dictionary data for page"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PageObject, line 74|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java#L74] | PdfInvalidException | A page's "Contents" entry contains neither a stream nor an array of streams. | | | |
| [PageObject, line 79|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java#L79] | PdfInvalidException | | | | |
| [PageObject, line 82|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java#L82] | PdfInvalidException | | | | |
| [PageObject, line 85|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java#L85] | PdfMalformedException | | | | |
h4. "Improperly nested dictionary delimiters"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Parser, line 100|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L100] | PdfMalformedException | More dictionary closing elements (">>") were encountered than dictionary opening elements ("<<"). | | | |
h4. "Malformed dictionary: Vector must contain an even number of objects, but has ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Parser, line 366|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L366] | PdfMalformedException | The dictionary has an odd number of objects, so cannot have a complete set of name–value pairs. | | | |
h4. "Malformed dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Parser, line 376|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L376] | PdfMalformedException | Unexpected error while parsing dictionary. | | | |
h4. "Root entry missing in cross-ref stream dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1035|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1035] | PdfInvalidException | | | | |
h4. "Invalid Prev offset in trailer dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1079|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1079] | PdfInvalidException | The "Prev" entry of a trailer dictionary does not reference a numeric value. Trailer "Prev" entries should specify the byte offset of the previous cross-reference section in a PDF with multiple cross-reference sections. | | If there is only one cross-reference section in a PDF, the "Prev" entry should be removed. | |
h4. "Invalid Size entry in trailer dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1100|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1100] | PdfInvalidException | The "Size" entry of a trailer dictionary does not contain a numeric value. Trailer "Size" entries should specify the total number of objects in a PDF's cross-reference table. | | | |
h4. "Size entry missing in trailer dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1109|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1109] | PdfInvalidException | Trailer has no "Size" value. Trailer "Size" entries are required to specify the total number of objects in a PDF's cross-reference table. | | | |
h4. "Trailer dictionary Info key is not an indirect reference"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1124|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1124] | PdfInvalidException | The "Info" entry of a trailer dictionary does not contain an indirect object reference (e.g. "1 0 R"). If an "Info" entry exists in a trailer, it should point to the document's information dictionary via an indirect object reference. | | | |
h4. "No document catalog dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1339|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1339] | ErrorMessage, \\ Malformed | | | | |
| [PdfModule, line 1355|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1355] | ErrorMessage, \\ Malformed | The document catalogue reference exists but cannot be resolved. | | | |
We are allowed to use and share this [PDF|^grid-system.pdf], the producer has provided it as an example. Unsure which of the two errors it triggers.
h4. "Invalid Names dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1457|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1457] | PdfInvalidException | | | | |
| [PdfModule, line 1461|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1461] | PdfMalformedException | | | | |
h4. "Invalid Dests dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1475|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1475] | PdfInvalidException | | | | |
| [PdfModule, line 1479|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1479] | PdfMalformedException | | | | |
h4. "Invalid algorithm value in encryption dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1557|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1557] | PdfInvalidException | | | | |
h4. "Invalid page dictionary object"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1692|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1692] | PdfMalformedException | | | | |
h4. "Expected dictionary for font entry in page resource"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2201|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2201] | ErrorMessage, \\ Malformed | | | | The Cabinet of Horrors has an [example PDF|^test_fontArialNotEmbedded.pdf] that can be openly used. |
h4. "Annotation object is not a dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2732|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2732] | PdfInvalidException | An item in a page's "Annots" array does not point to a dictionary. Each item in an annotation array should point to an annotation dictionary containing that annotation's details. | | | |
h4. "Invalid page dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2826|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2826] | PdfMalformedException | | | | |
h4. "Annotation dictionary missing required type (S) entry"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3077|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3077] | PdfMalformedException | | | | |
h4. "Outline dictionary missing required entry"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3789|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3789] \\ Commented out | PdfInvalidException | | | | |
h4. "Malformed outline dictionary"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3818|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3818] | PdfMalformedException | Unexpected error while parsing outline. | | | |
h4. "Outlines contain recursive references"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3803|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3803] | InfoMessage | | | | |
| [PdfModule, line 3916|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3916] | InfoMessage | | | | |
| [PdfModule, line 3934|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3934] | InfoMessage | | | | |
h4. "Invalid outline dictionary item"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3846|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3846] | PdfInvalidException | Outline item has no "Title" value. | | | |
| [PdfModule, line 3854|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3854] | PdfInvalidException | Outline item has no "Parent" reference. | | | |
| [PdfModule, line 3860|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3860] | PdfInvalidException | | | | |
| [PdfModule, line 3951|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3951] | PdfInvalidException | Unexpected object type while parsing an outline item. Possible causes include unexpected "Prev", "Next", "First", or "Last" values. | | | |
| [PdfModule, line 3954|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3954] | PdfInvalidException | Unexpected error while parsing outline item. | | | |
h4. "Outlines exist, but are not displayed; ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3975|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3975] | InfoMessage | | | | |
h4. "Improperly formed date"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 4074|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L4074] | PdfInvalidException | Date found in dictionary does not conform to the expected format. \\
E.g. this date is not syntactically correct: \\
/CreationDate (Friday, 11 December 1998 14:47) \\
This would be correct: \\
/CreationDate (D:199812111447) | | It may happen that after a "cure" there is no information about the creation date any more, if there are no XMP metadata in the original PDF. \\
The date may be written poorly enough that some tools cannot recognize the date and so do not translate it into the new/corrected PDF. | Use [this|https://econstor.eu/dspace/obitstream/10419/31712/1/605028710.PDF] as a reference, but find (or build) a better example eventually. |
h4. "Unexpected exception ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1676|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1676] | ErrorMessage, \\ Malformed | Unexpected error while parsing the document information dictionary. | | | |
| [PdfModule, line 1836|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1836] | ErrorMessage, \\ Malformed | Unexpected error while finding external streams. | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1485|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1485] | ErrorMessage | | | | |
| [PdfModule, line 1493|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1493] | ErrorMessage, \\ Malformed | Unexpected error while parsing the document catalog dictionary. | | | |
| [PdfModule, line 1669|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1669] | ErrorMessage | | | | |
| [PdfModule, line 1832|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1832] | ErrorMessage, \\ Malformed | | | | |
| [PdfModule, line 3981|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3981] | ErrorMessage | | | | |
h3. Fonts
Non embedded fonts are one of the biggest risks for the correct rendering of PDF files. If one of the used fonts is not embedded ind the PDF and the rendering device does not have the font, the PDF might not be rendered as the data producer once has intended. It even can lead to missing text, gaps within words or text shifting. The worst case would be that part of the text cannot be displayed correctly any more.
Some fonts cannot be embedded for copyright reasons. Furthermore, there can be name conflicts. Somebody saves his font as "myfont", does not embedd the fonts and the rendering device also has a font named "myfont" and chooses this font to render the text – which is indeed a very different font and changes the visual impression of the PDF a lot.
It is not mandatory for ISO-3200 to mbedd fonts. A non-embedded font does not necesarily lead to an invalid PDF. With PDF/A, however, this is different, every used font has to be embedded.
Therefore, an absolute valid PDF can be at risk for long-term availability if the fonts are not embedded. Here is an extrem example (from a slide from the PDF Days in Baseln in 2012): !fonts_notembedded.jpg|border=1!
h4. "Invalid Font entry in Resources"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [DocNode, line 138|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L138] | PdfMalformedException | | | | |
h4. "unexpected error in parsing font property"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 610|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L610] | ErrorMessage | | | | |
h4. "Too many fonts to report; some fonts omitted"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 614|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L614] | InfoMessage | The boundary should be at 1000 different fonts in one PDF. | | | |
h4. "Fonts exist, but are not displayed; ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2213|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2213] | InfoMessage | | | | |
h4. "Unexpected error in findFonts"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2231|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2231] | ErrorMessage, \\ Malformed | | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2223|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2223] | ErrorMessage | | | | |
h3. Cross-reference tables
The cross-reference table serves as an index for all the objects in a PDF file. Each item is shown with a "byte offset": the exact number of bytes from the beginning of the file to where the object begins. This allows software to find an object within a PDF file without having to scan the whole PDF. It is like an exact address within the PDF file.
{code:language=none|title=Cross-reference table example}
xref
0 5
0000000023 00000 n
0000000547 00000 n
0000001140 00000 n
0000000000 00001 f
0000002384 00000 n
{code}
h4. "Invalid cross-reference table"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1020|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1020] | PdfInvalidException | | | | |
| [PdfModule, line 1021|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1021] | PdfInvalidException | | | | |
h4. "Invalid object number in cross-reference stream"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1211|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1211] | PdfMalformedException | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/embedded_video_avi.pdf?version=1&modificationDate=1400574373000] [Files|https://wiki.dnb.de/download/attachments/93783881/webCapture.pdf?version=1&modificationDate=1400574598000] in the Cabinet of Horrors. |
h4. "Malformed cross reference stream"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1238|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1238] | ErrorMessage, \\ Malformed | | | | |
h4. "Illegal operator in xref table"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1306|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1306] | PdfMalformedException | An unexpected keyword was found in a cross-reference entry. Expected keywords are "f" or "n". | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1247|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1247] | ErrorMessage | | | | |
| [PdfModule, line 1316|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1316] | ErrorMessage | | | | |
| [PdfModule, line 1322|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1322] | ErrorMessage, \\ Invalid | Unexpected error while parsing the cross-reference table. | | | |
h3. XMP metadata
XMP (eXtensible Metadata Platform) metadata is based on XML and can be found not only in PDF, but also TIFF, JPEG, and other file formats. The most popular XMP scheme is Dublin Core, but there are others as well. XMP metadata is possible since PDF 1.4, earlier versions should not contain XMP metadata. There is an SDK (Software Development Kit) to work with XMP directly from Adobe.
PDF/A asks for certain XMP metadata, usually Preflight will fix that easily.
h4. "Invalid or ill-formed XMP metadata"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1777|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1777] | PdfInvalidException | | | | |
| [PdfModule, line 1791|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1791] | ErrorMessage, \\ Invalid | | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1785|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1785] | ErrorMessage | | | | |
h3. PDF header
The header is usually 1 or 2 lines. The first is mandatory and can look like this: {{%PDF-1.7}}
The first five bytes should be "%PDF-", followed by the PDF version number, such as "1.7" above.
The second line is optional and should contain at least four bytes of binary data, allowing other software, like e-mail or file-transfer clients, to categorise the file as binary instead of plain text.
h4. "No PDF header"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 803|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L803] | ErrorMessage, \\ Malformed | The PDF header could not be found within the file's first 1024 bytes. | | | [This PDF|^CERN-2005-009.pdf] can be rendered fine – however, there are some extra values prior to the PDF header which make the header invalid. |
h4. "File header gives version as ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1418|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1418] | InfoMessage | The PDF version specified in the header is different from the version specified in the document catalogue dictionary. | | | |
h4. "Invalid Version in document catalog"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1430|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1430] | PdfInvalidException | The document's PDF version, from either the file header or document catalog dictionary, cannot be recognised as a number. | | | |
h3. PDF trailers
The trailer is the entry point into the document's structure and should be located at the very end of a PDF file. A PDF that has been incrementally updated can have multiple trailers.
Each trailer should consist of a dictionary object, the byte offset to its cross-reference section, and an end-of-file marker.
A trailer dictionary should contain the total number of objects in the PDF at the time it was written ("Size"), a reference to the document catalogue ("Root"), a reference to the previous trailer if one exists ("Prev"), and a few other optional entries.
{code:language=none|title=Trailer example}
trailer
<<
/Size 5
/Root 1 0 R
...
>>
startxref
498
%%EOF
{code}
h4. "No PDF trailer"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 937|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L937] | ErrorMessage, \\ Malformed | | | Cannot be repaired (I guess), because the PDF is not complete. | [Example PDF|^567147525.pdf] |
h4. "Missing startxref keyword or value"


|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 994|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L994] | ErrorMessage, \\ Malformed | | | | |
h4. "No file trailer"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1060|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1060] | ErrorMessage, \\ Malformed | | | | |
h4. "Invalid ID in trailer"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1139|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1139] | PdfInvalidException | | | | |
| [PdfModule, line 1151|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1151] | PdfInvalidException | | | | |
| [PdfModule, line 1155|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1155] | PdfInvalidException | | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 512|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L512] | ErrorMessage, \\ Malformed | | | | |
| [PdfModule, line 1169|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1169] | ErrorMessage | | | | |
h4. Invalid PDF Trailer
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| | PdfMalformedException | Very often the upload of a PDF has stopped and the last part is missing. No %EOF can be found | | | [Example PDF|^567147525.pdf] |
| | PdfMalformedException | Very often the upload of a PDF has stopped and the last part is missing. No %EOF can be found | | | [Example PDF|^567147525.pdf] |
h3. Pages and page trees


|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [DocNode, line 159|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L159] | PdfInvalidException | There has to be a rectangle: \\
PDF Rectangle: Any ArtBox, BleedBox, MediaBox and TrimBox must be compliant PDF rectangles. E.g. /Rect \[2 3 4 5\] which specifies the X and Y coordinates of the upper right and lower left corners of the rectangle. | | | |
| [DocNode, line 162|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java#L162] | PdfInvalidException | | | | |
h4. "Invalid Page tree node"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PageTreeNode, line 138|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageTreeNode.java#L138] | PdfInvalidException | | | | |
h4. "Document page tree not found"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1687|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1687] | PdfInvalidException | The document catalogue is missing its "Pages" entry. The entry should point to the document's main, or "root", page tree. | | | |
h4. "Bad page labels"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2635|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2635] | PdfMalformedException | | | | |
h4. "Page information is not displayed; ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2670|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2670] | InfoMessage | | | | |
h4. "Invalid page label info"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2715|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2715] | PdfMalformedException | | | | |
h4. "Invalid page label sequence"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2873|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2873] | PdfInvalidException | | | | |
h4. "Problem with page label structure"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2921|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2921] | PdfMalformedException | | | | |
h4. "Unexpected exception ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1732|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1732] | ErrorMessage, \\ Malformed | Unexpected error while parsing the document page label tree. | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1700|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1700] | ErrorMessage | | | | |
| [PdfModule, line 1707|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1707] | ErrorMessage, \\ Malformed | Unexpected error while parsing the document page tree. | | | |
| [PdfModule, line 1725|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1725] | ErrorMessage | | | | |
| [PdfModule, line 2679|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2679] | ErrorMessage | | | | |
h4. Improperly constructed page tree
{note:title=To do}
There is more info in the German wiki which has to be translated.
{note}
Das stammt aus der Java-Klasse [PageTreeNode|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageTreeNode.java].
Dieses [Beispiel-PDF|https://wiki.dnb.de/download/attachments/93783881/ImproperlyConstructedPageTree.pdf?version=1&modificationDate=1400575699000] kann als Beispiel genutzt werden, da es eigens zu Testzwecken erstellt wurde. Es gibt 2x den Fehler "improperly constructed page tree" aus und ansonsten keine weiteren Fehlermeldungen und wird von JHOVE als "not well-formed" eingestuft.
Während des [PDF Hackathon der OPF|http://openplanetsfoundation.org/blogs/2014-09-03-my-first-hackathon-hacking-pdf-files] (Open Presentation Foundation) gemeinsam mit der ZBW (Deutsche Zentralbibliothek für Wirtschaftswissenschaften) und Goportis (Leibniz-Bibliotheksverbund Forschungsinformation) in Hamburg wies Olaf Drümmer von der PDF Association auf eine interessante false negative Fehlermeldung von JHOVE hin.
Die Seiten einer PDF-Datei sind in der Regel in einem Page [Tree|http://en.wikipedia.org/wiki/Tree_%28data_structure%29] gespeichert, um möglichst rasch auf eine bestimmte Seite gelangen zu können\[[2]\|http://zbwintern/wiki/display/dLZA/Ein+PDF%2C+das+Jhove+als+solches+anerkennt#_ftn2\]. Dieser wird häufig als balancierter Page Tree gebildet. Obgleich der PDF-Standard auf diese Möglichkeit hinweist, schreibt er sie in keiner Weise vor.
Die Seiten können auch in einem einfachen Array aus Seiten gespeichert werden, auch das entspricht dem PDF-Standard. Es ist lediglich weniger effizient beim Seitenzugriff (schlechtere Performanz), vor allem wenn es sich um eine PDF-Datei mit besonders vielen Seiten handelt. JHOVE hingegen gibt es als Fehler aus, wenn die Seiten in einem Array anstatt in einem Page Tree gespeichert sind. Da dies kein Fehler ist und für die digitale Langzeitarchivierung nicht risikobehaftet, kann diese Meldung ignoriert werden.
Zitat aus dem PDF-Standard (ISO 32000-1 aka PDF 1.7) unter 7.7.3 Page Tree / 7.7.3.1 General:
"NOTE: The simplest structure can consist of a single page tree node that references all of the document’s page objects directly. However, to optimize application performance, a conforming writer can construct trees of a particular form, known as balanced trees. Further information on this form of tree can be found in Data Structures and Algorithms, by Aho, Hopcroft, and Ullman (see the Bibliography)."
Es wird also rein informativ darauf hingwiesen, dass page trees sinnvoll sind. Allerdings muss man sich zu page trees außerhalb des PDF-Standards informieren (Quelle wird genannt). Es ist in keiner Weise vorgeschrieben, dass man page trees nutzen muss. Ein bestimmter Schwellwert wird nicht genannt – Leonard Rosenthol hat m. E. in seiner Monographie (Developing with PDF: Dive Into The Portable Document Format by Leonard Rosenthol, page 24) von 50 gesprochen, Olaf Drümmer hat berichtet, dass ein Adobe-Mitarbeiter ihm von einem Test erzählt hat, bei dem sie auf 64 gekommen sind, das hängt aber stark vom Material ab. Es ist davon auszugehen, dass es ungefähr in der Liga spielt, bei 5 oder 1000 liegen die Werte bestimmt nicht.
Weitere PDF-Dateien mit dieser Fehlermeldung weisen ebenfalls die Besonderheit auf, dass die Fehlermeldung 2x auftaucht. Könnte man ggf. anhand des SourceCodes nachvollziehen.
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| | PdfMalformedException | | | | |
h3. PDF objects
In general there are 8 object types and one special type (so 9 in all) that are supported by the PDF format. Six are scalar types (contain only one value/object) and three are container types that can contain multiple value. These are dictionary, array and stream. There are tools from Adobe which can be [used for object analysis|https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/].
{note:title=To do}
Translate the rest.
{note}
# *Boolean Objects:* True or false
# *Numeric Objects:* Integer or real numbers
# *String Objects:* A sequence of 8-bit bytes, which represent text: Literal Strings, Hexadecimal Strings. PDF 1.7 allows for Text Strings, PDFDocEncoded Strings, ASCII Strings & Byte Strings.
# *Name Object:* Charakerfolge, die mit einem Slash („/“) eingeleitet wird. Leerzeichen und einige bestimmte Delimeter-Charaktere sind in Namen nicht erlaubt, können aber dargestellt werden, indem stattdessen der korrespondierende Hexadezimalcode verwendet wird.
# *Array Object:* Only one-dimensional arrays. All object types in an array are possible, even other arrays. Always displayed with \[ \] .
# *Dictionary Objects:*
# *Stream Objects:* A sequence of bytes, die unbegrenzt lang sein können, ganz im Gegensatz zu String Objects. Ein Stream Object beginnt immer mit einem Dictionary, das die Byte-Sequenz beschreibt (Größe, Filter, Dekodierungsparameter) und dann folgt der Stream, der zwischen „stream“ und „endstream“ eingeschoben ist. Hier ein Beispiel:
2 0 obj
<</Length 39>>
stream
BT
/F1 12 Tf
72 712 Td (A short text stream.) Tj
ET
endstream
endobj
*8. * *Null Object:* An einigen Stellen wird empfohlen, ein Objekt besser ganz zu löschen anstatt es auf null zu setzen. Im JHOVE-Code gibt es viele „== null“-Abfragen, die oftmals beim Zutreffen zu einer Exception führen.
*9. * *Indirect Object:*
h4. "Invalid name tree"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [NameTreeNode, line 91|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/NameTreeNode.java#L91] | PdfInvalidException | | | | |
| [NameTreeNode, line 94|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/NameTreeNode.java#L94] | PdfInvalidException | | | | |
| [NameTreeNode, line 97|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/NameTreeNode.java#L97] | PdfMalformedException | | | | |
| [NameTreeNode, line 160|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/NameTreeNode.java#L160] | PdfMalformedException | | | | |
| [NameTreeNode, line 166|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/NameTreeNode.java#L166] | PdfMalformedException | | | | |
h4. "Improperly nested array delimiters"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Parser, line 109|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L109] | PdfMalformedException | More array closing elements ("]") were encountered than array opening elements ("["). | | | |
h4. "Invalid object definition"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Parser, line 208|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L208] | PdfInvalidException | | | | |
| [Parser, line 225|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L225] \\ Commented out | PdfInvalidException | Same as above. | | | |
| [Parser, line 226|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L226] | PdfInvalidException | | | | |
| [Parser, line 227|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L227] | PdfInvalidException | | | | |
| [Parser, line 229|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java#L229] | PdfMalformedException | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/corruptionOneByteMissing.pdf?version=1&modificationDate=1400574262000] from the Cabinet of Horrors |
h4. "Improper nesting of object streams"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2390|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2390] | PdfMalformedException | | | | |
h4. "Malformed filter"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfStream, line 204|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PdfStream.java#L204] | PdfMalformedException | A filter has to be either an instance of the PdfDictionary or of the PdfArray. Otherwise, it is malformed. (To my humble understanding, needs to be checked.) | | | |
h4. java.lang.ClassCastException: PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| | | This does not shown in the GUI, only in the java-library-version \\
I have a long German explanation which I can translate someday. \\
Seems to be a JHOVE bug and not a real PDF error. | | | Example from the [BSB|^grid-system.pdf]. \\
Another example in a forum [PDF|https://wiki.dnb.de/sourceforge.net/p/jhove/bugs/_discuss/thread/8d3d4539/e700/attachment/test.pdf]. |
h3. Annotations
All annotations need to be well-formed. This is quite similar to the definition of a well-formed XML, but as an XML usually is far less complex, it is easier to tell and to parse.
{code:language=none|title=Annotation example}
22 0 obj
<<
/Type /Annot
/Subtype /Text
/Rect [266 116 430 204]
/Contents (The quick brown fox jumped over the lazy dogs.)
>>
endobj
{code}
h4. "Annotations exist, but are not displayed; ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2748|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2748] | InfoMessage | | | | |
h4. "Invalid Annotation list"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2760|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2760] | PdfMalformedException | | | | |
h4. "Invalid Annotation property"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 3139|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3139] | PdfMalformedException | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/externalLink.pdf?version=1&modificationDate=1400574455000] from the Cabinet of Horrors |
h3. Invalid characters, syntactic errors
h4. "Invalid character in hex string"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Literal, line 358|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Literal.java#L358] | PdfMalformedException | There is an if-statement which tests which HexValues are allowed/valid and which are not. \\
Invalid lead to an invalid PDF. | | | The NLNZ has an example but it's not possible to share it. |
| [Tokenizer, line 808|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Tokenizer.java#L808] | PdfMalformedException | There is an if-statement which tests which HexValues are allowed/valid and which are not. \\
Invalid lead to an invalid PDF. | | | The NLNZ has an example but it's not possible to share it. |
h3. Issues with colour management
The PDF format works with image data streams and not with image file formats. The most important filters/compressions are:
* 1-bit data: Fax-compression group 3 or 4, [JBIG2|https://en.wikipedia.org/wiki/JBIG2]
* Greyscale, RGB or CMYK data: *JPEG, JPEG 2000* (*{_}DCTDecode is the filter JPEG uses{_}*)
* usable for all kind of image data: *ZIP*
* alternatively *[LZW|https://en.wikipedia.org/wiki/Lempel-Ziv-Welch#Patents]* can be used, though this is not possible in PDF/A-1 as the patent only expired in 2004
* *RLE* (Run Length Encoding) is possible, but is uncommon due to its inefficiency
It is possible to embed the kind of data stream in a PDF which would also be used by a JPEG or JPEG 2000 file. Only the data stream is used which deals with the image itself, no information like metadata is added to that.
A TIFF image would be stored in a PDF e.g. like a JPEG, a TIFF itself cannot be embedded 1 to 1 in a PDF (which is possible with a JPEG).
h4. "Compression method is invalid or unknown to JHOVE"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2435|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2435] | PdfMalformedException | Try-catch if "ZipException". Is zip the only kind of compression JHOVE knows? But there should be 4 other ones for image data streams. | | | |
h3. Interactive content
Interactive content often depends on external information, which can lead to problems and limited functionality. Sometimes fill-in-forms are presented differently.
h3. Encryption
In general, JHOVE can deal with password protected PDF files. This does not lead to invalidity (exception: PDF/A). The boolean value "_encrypted" just is set on true. Some JHOVE versions even return this value in the output (German National Library's version, mine does not). So it should be possible to use JHOVE just to determine password protection, but of course JHOVE might be too "big" for such a relatively small task.
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1635|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1635] | ErrorMessage | | | | |
h3. Miscellaneous
h4. "Invalid destination object"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Destination, line 93|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Destination.java#L93] | PdfInvalidException | Unexpected error while constructing a destination object; or... \\
There are several valid destination objects: \\
An unnamed, direct destination, which refers to the page object. \\
An unnamed, indirect destination, which refers to a named, direct destination, which refers to the page object. \\
\\
If it is no PdfArray and no PdfDictionary, this error is thrown. Can occur more than once in a PDF file. | | | |
h4. "Invalid object number or object stream"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2424|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2424] | PdfMalformedException | | | | |
| [PdfModule, line 2440|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2440] | PdfMalformedException | | | | |
h4. "Lexical error"
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [Tokenizer, line 362|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Tokenizer.java#L362] | PdfMalformedException | | | | |
| [Tokenizer, line 374|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Tokenizer.java#L374] | PdfMalformedException | | | | |
h4. "Unexpected exception ..."
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 2146|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2146] | ErrorMessage, \\ Malformed | Unexpected error while finding images. | | | |
h4. Variable
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| [PdfModule, line 1876|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L1876] | ErrorMessage | | | | |
| [PdfModule, line 2141|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L2141] | ErrorMessage | | | | |
| [PdfModule, line 3191|https://github.com/openpreserve/jhove/blob/release-1.14/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3191] | ErrorMessage, \\ Invalid | | | | |
h4. java.lang.NullPointerException
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| | | A bug in the source code. | Too generic to be able to determine the impact of this error, depends on the location of the occurrence. | Line numbers for these errors need to be noted and reported as issues. | |
h4. java.lang.OutOfMemoryError
JHOVE can run out of memory space during the PDF examination. Some examples are listed in the [SourceForge Bug Reporter|http://sourceforge.net/p/jhove/bugs/].
|| Source code || Type || Explanation || Impact || Cure || Example PDF ||
| | | The PDF might be perfectly valid, there is just too much space needed to validate | | | |
A possible reason might be a very big dictionary because of very many images. 10,000 images are no problem, but an unlimited number of images can lead into problems, if the PDF is built from very many images. (There is a nice use case of the Germany National Library, which we can probably borrow.)
h5. Very big dictionary because of too many listed pictures workaround
The German National Library in Frankfurt has found that JHOVE causes the java heap space to run full if there are too many listed pictures in the PDF Dictionary. They have developed a workaround for this issue to keep java from failing.
{code:language=java|title=PdfModule.java > findImages}
// Heins, 2014-10-30
if (_imagesList.size() <= DEFAULT_MAX_IMAGES) {
_imagesList.add (prop);
}
{code}
The DEFAULT_MAX_IMAGES depend. PDF/A allows 4095 entries. Tests have shown that 10,000 would alos be OK. But no limit causes a heap space error around 1,251,900 entries. This will surely have more dependencies, so these numbers are from a test the German National Library has conducted.
{note:title=To do}
As the DNB has agreed to share this use case, this will be described in more detail soon.
{note}
h2. JHOVE metadata extraction errors
JHOVE errors found as part of migration of image-based materials to Ex Libris' Rosetta by the State Library of New South Wales (SLNSW). Assistance in analysing some of these errors was provided by Digital Preservation staff at the National Library of New Zealand.
h3. Metadata extraction from TIFF files
The following errors have been experienced with image-based materials:
* Technical MD Extract:Fail - Error/s returned during metadata extraction (ColorSpace value out of range: 2)
** *Error analysis:* This error occurred on a TIF file. JHOVE expected to see either value: “1” or “65535” (based on the TIFF specification). Instead the value it was encountering was "2".
* Technical MD Extract:Fail - Error/s returned during metadata extraction (FocalPlaneResolutionUnit value out of range: 4)
** *Error analysis:* This error occurred on a TIF file. JHOVE expected to see a value in the range of: 1 - 3 (based on the TIFF specification). The value appearing is 4.
* Technical MD Extract:Fail - Error/s returned during metadata extraction (PhotometricInterpretation not defined,ImageWidth not defined,ImageLength not defined,Neither strips nor tiles defined,Neither strips nor tiles defined)
** *Error analysis:* This error occurred on a TIF file. File was missing critical information and so image did not render (however it was not clear from this error message that the issue would result in a file not rendering.)
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 306; expecting 20; saw 19,Failed to retrieve extractor properties)
** *Error analysis:* This error occurred on a TIF file. File should contain 20 bytes however there were only 19 (and so it did not meet the ISO datetime standard).
* Technical MD Extract:Fail - Error/s returned during metadata extraction (FileSource value out of range: 77)
** *Error analysis:* This error occurred on a TIF file. File should contain the value 3 or 7 for this field. Instead it contains the value 77.
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 36867; expecting 20; saw 11,Failed to retrieve extractor properties)
** *Error analysis:* This error occurred on a TIF file. File should contain 20 bytes however there were only 11 (and so it did not meet the DateTimeOriginal standard for the field, as per the TIFF specification. The TIFF spec states: "When the field is left blank, it is treated as unknown.").
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Tag 34665 out of sequence)
** *Error analysis:* This error occurred on a TIF file. Issue hasn't yet been fully analysed, however info can be found at TIFF specification: [http://www.awaresystems.be/imaging/tiff/tifftags/exififd.html]