JHOVE issues and error messages

Skip to end of metadata
Go to start of metadata
This page is under construction. Please feel free to add, edit and correct.
This page is intended to capture error messages found during testing with JHOVE. It also contains some broad PDF knowledge.
To do
  • Add missing errors.
  • Expand and improve explanations.
  • Maybe there could be some kind of impact gamut?
  • The possible cures should be confirmed. Some are just guesses.

PDF module

Please note that there has not been an update of JHOVE (yet) since PDF 1.6. This is why JHOVE cannot determine the validity for PDF 1.7 (and higher) for sure, although it gives a nice clue about it of course. For the same reason JHOVE cannot really deal with PDF/A-2, as this is built on PDF 1.7.

JHOVE can throw two different types of exception: a PdfMalformedException and a PdfInvalidException.

Well-formedness

To be considered well-formed by JHOVE, a PDF must consist of:

  • a PDF header (e.g. %PDF-1.0)
  • an end-of-file marker (i.e. %%EOF)
  • a body consisting of well-formed objects
  • a cross-reference table
  • a trailer defining the cross-reference table size
  • an indirect reference to the document catalog dictionary

Validity

A valid PDF must be well-formed, and fulfill the following criteria:

  • The document structure conforms to the specification. This includes (when present) outlines, pages, the page label tree, attributes, resources, role maps, name trees...
  • Version information in the document catalog dictionary, if present, is properly formed.
  • Dates are properly formed.
  • File specifications are properly formed.
  • Any annotations are properly formed.
  • Any ArtBox, BleedBox, MediaBox and TrimBox items are PDF rectangles.
  • XMP data, if present, are well-formed.

Dictionaries

A PDF file consists of PDF objects referenced by PDF dictionaries.

A PDF dictionary is a collection of objects indexed by name, or name–value pairs.

PDF dictionaries are embedded between "<<" and ">>" elements. The below example has been broken onto multiple lines for clarity:

Dictionary example

Each dictionary entry consists of a pair of objects. The first object should be a name object, which begins with a slash ("/"), and is followed by a value, which can be any kind of PDF object.

In the above example we see the following:

  • /Subtype paired with another name object,
  • /Length paired with a numeric object,
  • /Filter paired with an array of name objects; and,
  • /Metadata paired with an indirect object reference.

In theory, it is possible to add custom entries, but they will be ignored by Acrobat Reader. For long-term availability, this does not seem to be a good idea anyway.

"Missing dictionary in document node"

Source code Type Explanation Impact Cure Example PDF
DocNode, line 104 PdfMalformedException A page or page tree is missing its dictionary. All pages and page trees require a dictionary, which provides access to their resources and other attributes. The page or any pages descending from the page tree will be inaccessible and may not appear in a reader. Is it possible to build a page's dictionary after the fact? Maybe iText can fix it. We (at ZBW) have an iText-Tool, which just copies each page into a new PDF. The PDF structure gets repaired by this procedure and I would guess that it would build a brand new PDF Dictionary for the PDF. I do not have any example on hand, though, so I cannot check.  

"Invalid Resources Entry in document"

Source code Type Explanation Impact Cure Example PDF
DocNode, line 112 PdfInvalidException        
DocNode, line 115 PdfInvalidException        

"Missing expected element in page number dictionary"

Source code Type Explanation Impact Cure Example PDF
PageLabelNode, line 178 PdfInvalidException        

"Invalid dictionary data for page"

Source code Type Explanation Impact Cure Example PDF
PageObject, line 74 PdfInvalidException A page's "Contents" entry contains neither a stream nor an array of streams.      
PageObject, line 79 PdfInvalidException        
PageObject, line 82 PdfInvalidException        
PageObject, line 85 PdfMalformedException        

"Improperly nested dictionary delimiters"

Source code Type Explanation Impact Cure Example PDF
Parser, line 100 PdfMalformedException More dictionary closing elements (">>") were encountered than dictionary opening elements ("<<").      

"Malformed dictionary: Vector must contain an even number of objects, but has ..."

Source code Type Explanation Impact Cure Example PDF
Parser, line 366 PdfMalformedException The dictionary has an odd number of objects, so cannot have a complete set of name–value pairs.      

"Malformed dictionary"

Source code Type Explanation Impact Cure Example PDF
Parser, line 376 PdfMalformedException Unexpected error while parsing dictionary.      

"Root entry missing in cross-ref stream dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1035 PdfInvalidException        

"Invalid Prev offset in trailer dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1079 PdfInvalidException The "Prev" entry of a trailer dictionary does not reference a numeric value. Trailer "Prev" entries should specify the byte offset of the previous cross-reference section in a PDF with multiple cross-reference sections.   If there is only one cross-reference section in a PDF, the "Prev" entry should be removed.  

"Invalid Size entry in trailer dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1100 PdfInvalidException The "Size" entry of a trailer dictionary does not contain a numeric value. Trailer "Size" entries should specify the total number of objects in a PDF's cross-reference table.      

"Size entry missing in trailer dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1109 PdfInvalidException Trailer has no "Size" value. Trailer "Size" entries are required to specify the total number of objects in a PDF's cross-reference table.      

"Trailer dictionary Info key is not an indirect reference"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1124 PdfInvalidException The "Info" entry of a trailer dictionary does not contain an indirect object reference (e.g. "1 0 R"). If an "Info" entry exists in a trailer, it should point to the document's information dictionary via an indirect object reference.      

"No document catalog dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1339 ErrorMessage,
Malformed
       
PdfModule, line 1355 ErrorMessage,
Malformed
The document catalogue reference exists but cannot be resolved.      

We are allowed to use and share this PDF, the producer has provided it as an example. Unsure which of the two errors it triggers.

"Invalid Names dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1457 PdfInvalidException        
PdfModule, line 1461 PdfMalformedException        

"Invalid Dests dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1475 PdfInvalidException        
PdfModule, line 1479 PdfMalformedException        

"Invalid algorithm value in encryption dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1557 PdfInvalidException        

"Invalid page dictionary object"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1692 PdfMalformedException        

"Expected dictionary for font entry in page resource"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2201 ErrorMessage,
Malformed
      The Cabinet of Horrors has an example PDF that can be openly used.

"Annotation object is not a dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2732 PdfInvalidException An item in a page's "Annots" array does not point to a dictionary. Each item in an annotation array should point to an annotation dictionary containing that annotation's details.      

"Invalid page dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2826 PdfMalformedException        

"Annotation dictionary missing required type (S) entry"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3077 PdfMalformedException        

"Outline dictionary missing required entry"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3789
Commented out
PdfInvalidException        

"Malformed outline dictionary"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3818 PdfMalformedException Unexpected error while parsing outline.      

"Outlines contain recursive references"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3803 InfoMessage        
PdfModule, line 3916 InfoMessage        
PdfModule, line 3934 InfoMessage        

"Invalid outline dictionary item"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3846 PdfInvalidException Outline item has no "Title" value.      
PdfModule, line 3854 PdfInvalidException Outline item has no "Parent" reference.      
PdfModule, line 3860 PdfInvalidException        
PdfModule, line 3951 PdfInvalidException Unexpected object type while parsing an outline item. Possible causes include unexpected "Prev", "Next", "First", or "Last" values.      
PdfModule, line 3954 PdfInvalidException Unexpected error while parsing outline item.      

"Outlines exist, but are not displayed; ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3975 InfoMessage        

"Improperly formed date"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 4074 PdfInvalidException Date found in dictionary does not conform to the expected format.
E.g. this date is not syntactically correct:
/CreationDate (Friday, 11 December 1998 14:47)
This would be correct:
/CreationDate (D:199812111447)
  It may happen that after a "cure" there is no information about the creation date any more, if there are no XMP metadata in the original PDF.
The date may be written poorly enough that some tools cannot recognize the date and so do not translate it into the new/corrected PDF.
Use this as a reference, but find (or build) a better example eventually.

"Unexpected exception ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1676 ErrorMessage,
Malformed
Unexpected error while parsing the document information dictionary.      
PdfModule, line 1836 ErrorMessage,
Malformed
Unexpected error while finding external streams.      

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1485 ErrorMessage        
PdfModule, line 1493 ErrorMessage,
Malformed
Unexpected error while parsing the document catalog dictionary.      
PdfModule, line 1669 ErrorMessage        
PdfModule, line 1832 ErrorMessage,
Malformed
       
PdfModule, line 3981 ErrorMessage        

Fonts

Non embedded fonts are one of the biggest risks for the correct rendering of PDF files. If one of the used fonts is not embedded ind the PDF and the rendering device does not have the font, the PDF might not be rendered as the data producer once has intended. It even can lead to missing text, gaps within words or text shifting. The worst case would be that part of the text cannot be displayed correctly any more.

Some fonts cannot be embedded for copyright reasons. Furthermore, there can be name conflicts. Somebody saves his font as "myfont", does not embedd the fonts and the rendering device also has a font named "myfont" and chooses this font to render the text – which is indeed a very different font and changes the visual impression of the PDF a lot.

It is not mandatory for ISO-3200 to mbedd fonts. A non-embedded font does not necesarily lead to an invalid PDF. With PDF/A, however, this is different, every used font has to be embedded.

Therefore, an absolute valid PDF can be at risk for long-term availability if the fonts are not embedded. Here is an extrem example (from a slide from the PDF Days in Baseln in 2012):

"Invalid Font entry in Resources"

Source code Type Explanation Impact Cure Example PDF
DocNode, line 138 PdfMalformedException        

"unexpected error in parsing font property"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 610 ErrorMessage        

"Too many fonts to report; some fonts omitted"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 614 InfoMessage The boundary should be at 1000 different fonts in one PDF.      

"Fonts exist, but are not displayed; ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2213 InfoMessage        

"Unexpected error in findFonts"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2231 ErrorMessage,
Malformed
       

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2223 ErrorMessage        

Cross-reference tables

The cross-reference table serves as an index for all the objects in a PDF file. Each item is shown with a "byte offset": the exact number of bytes from the beginning of the file to where the object begins. This allows software to find an object within a PDF file without having to scan the whole PDF. It is like an exact address within the PDF file.

Cross-reference table example

"Invalid cross-reference table"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1020 PdfInvalidException        
PdfModule, line 1021 PdfInvalidException        

"Invalid object number in cross-reference stream"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1211 PdfMalformedException       PDF Files in the Cabinet of Horrors.

"Malformed cross reference stream"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1238 ErrorMessage,
Malformed
       

"Illegal operator in xref table"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1306 PdfMalformedException An unexpected keyword was found in a cross-reference entry. Expected keywords are "f" or "n".      

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1247 ErrorMessage        
PdfModule, line 1316 ErrorMessage        
PdfModule, line 1322 ErrorMessage,
Invalid
Unexpected error while parsing the cross-reference table.      

XMP metadata

XMP (eXtensible Metadata Platform) metadata is based on XML and can be found not only in PDF, but also TIFF, JPEG, and other file formats. The most popular XMP scheme is Dublin Core, but there are others as well. XMP metadata is possible since PDF 1.4, earlier versions should not contain XMP metadata. There is an SDK (Software Development Kit) to work with XMP directly from Adobe.

PDF/A asks for certain XMP metadata, usually Preflight will fix that easily.

"Invalid or ill-formed XMP metadata"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1777 PdfInvalidException        
PdfModule, line 1791 ErrorMessage,
Invalid
       

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1785 ErrorMessage        

PDF header

The header is usually 1 or 2 lines. The first is mandatory and can look like this: %PDF-1.7

The first five bytes should be "%PDF-", followed by the PDF version number, such as "1.7" above.

The second line is optional and should contain at least four bytes of binary data, allowing other software, like e-mail or file-transfer clients, to categorise the file as binary instead of plain text.

"No PDF header"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 803 ErrorMessage,
Malformed
The PDF header could not be found within the file's first 1024 bytes.     This PDF can be rendered fine – however, there are some extra values prior to the PDF header which make the header invalid.

"File header gives version as ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1418 InfoMessage The PDF version specified in the header is different from the version specified in the document catalogue dictionary.      

"Invalid Version in document catalog"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1430 PdfInvalidException The document's PDF version, from either the file header or document catalog dictionary, cannot be recognised as a number.      

PDF trailers

The trailer is the entry point into the document's structure and should be located at the very end of a PDF file. A PDF that has been incrementally updated can have multiple trailers.

Each trailer should consist of a dictionary object, the byte offset to its cross-reference section, and an end-of-file marker.

A trailer dictionary should contain the total number of objects in the PDF at the time it was written ("Size"), a reference to the document catalogue ("Root"), a reference to the previous trailer if one exists ("Prev"), and a few other optional entries.

Trailer example

"No PDF trailer"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 937 ErrorMessage,
Malformed
    Cannot be repaired (I guess), because the PDF is not complete.  

"Missing startxref keyword or value"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 994 ErrorMessage,
Malformed
       

"No file trailer"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1060 ErrorMessage,
Malformed
       

"Invalid ID in trailer"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1139 PdfInvalidException        
PdfModule, line 1151 PdfInvalidException        
PdfModule, line 1155 PdfInvalidException        

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 512 ErrorMessage,
Malformed
       
PdfModule, line 1169 ErrorMessage        

Invalid PDF Trailer

Source code Type Explanation Impact Cure Example PDF
  PdfMalformedException Very often the upload of a PDF has stopped and the last part is missing. No %EOF can be found     Example PDF

Pages and page trees

"Malformed MediaBox in page tree"

Source code Type Explanation Impact Cure Example PDF
DocNode, line 159 PdfInvalidException There has to be a rectangle:
PDF Rectangle: Any ArtBox, BleedBox, MediaBox and TrimBox must be compliant PDF rectangles. E.g. /Rect [2 3 4 5] which specifies the X and Y coordinates of the upper right and lower left corners of the rectangle.
     
DocNode, line 162 PdfInvalidException        

"Invalid Page tree node"

Source code Type Explanation Impact Cure Example PDF
PageTreeNode, line 138 PdfInvalidException        

"Document page tree not found"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1687 PdfInvalidException The document catalogue is missing its "Pages" entry. The entry should point to the document's main, or "root", page tree.      

"Bad page labels"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2635 PdfMalformedException        

"Page information is not displayed; ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2670 InfoMessage        

"Invalid page label info"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2715 PdfMalformedException        

"Invalid page label sequence"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2873 PdfInvalidException        

"Problem with page label structure"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2921 PdfMalformedException        

"Unexpected exception ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1732 ErrorMessage,
Malformed
Unexpected error while parsing the document page label tree.      

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1700 ErrorMessage        
PdfModule, line 1707 ErrorMessage,
Malformed
Unexpected error while parsing the document page tree.      
PdfModule, line 1725 ErrorMessage        
PdfModule, line 2679 ErrorMessage        

Improperly constructed page tree

To do
There is more info in the German wiki which has to be translated.

Das stammt aus der Java-Klasse PageTreeNode.

Dieses Beispiel-PDF kann als Beispiel genutzt werden, da es eigens zu Testzwecken erstellt wurde. Es gibt 2x den Fehler "improperly constructed page tree" aus und ansonsten keine weiteren Fehlermeldungen und wird von JHOVE als "not well-formed" eingestuft.

Während des PDF Hackathon der OPF (Open Presentation Foundation) gemeinsam mit der ZBW (Deutsche Zentralbibliothek für Wirtschaftswissenschaften)  und Goportis (Leibniz-Bibliotheksverbund Forschungsinformation) in Hamburg wies Olaf Drümmer von der PDF Association  auf eine interessante false negative Fehlermeldung von JHOVE hin.

Die Seiten einer PDF-Datei sind in der Regel in einem Page Tree gespeichert, um möglichst rasch auf eine bestimmte Seite gelangen zu können[[2]|http://zbwintern/wiki/display/dLZA/Ein+PDF%2C+das+Jhove+als+solches+anerkennt#_ftn2]. Dieser wird häufig als balancierter Page Tree gebildet. Obgleich der PDF-Standard auf diese Möglichkeit hinweist, schreibt er sie in keiner Weise vor.

Die Seiten können auch in einem einfachen Array aus Seiten gespeichert werden, auch das entspricht dem PDF-Standard. Es ist lediglich weniger effizient beim Seitenzugriff (schlechtere Performanz), vor allem wenn es sich um eine PDF-Datei mit besonders vielen Seiten handelt. JHOVE hingegen gibt es als Fehler aus, wenn die Seiten in einem Array anstatt in einem Page Tree gespeichert sind. Da dies kein Fehler ist und für die digitale Langzeitarchivierung nicht risikobehaftet, kann diese Meldung ignoriert werden.

Zitat aus dem PDF-Standard (ISO 32000-1 aka PDF 1.7) unter 7.7.3 Page Tree / 7.7.3.1 General:

"NOTE: The simplest structure can consist of a single page tree node that references all of the document’s page objects directly. However, to optimize application performance, a conforming writer can construct trees of a particular form, known as balanced trees. Further information on this form of tree can be found in Data Structures and Algorithms, by Aho, Hopcroft, and Ullman (see the Bibliography)."

Es wird also rein informativ darauf hingwiesen, dass page trees sinnvoll sind. Allerdings muss man sich zu page trees außerhalb des PDF-Standards informieren (Quelle wird genannt). Es ist in keiner Weise vorgeschrieben, dass man page trees nutzen muss. Ein bestimmter Schwellwert wird nicht genannt – Leonard Rosenthol hat m. E. in seiner Monographie (Developing with PDF: Dive Into The Portable Document Format by Leonard Rosenthol, page 24) von 50 gesprochen, Olaf Drümmer hat berichtet, dass ein Adobe-Mitarbeiter ihm von einem Test erzählt hat, bei dem sie auf 64 gekommen sind, das hängt aber stark vom Material ab. Es ist davon auszugehen, dass es ungefähr in der Liga spielt, bei 5 oder 1000 liegen die Werte bestimmt nicht.

Weitere PDF-Dateien mit dieser Fehlermeldung weisen ebenfalls die Besonderheit auf, dass die Fehlermeldung 2x auftaucht. Könnte man ggf. anhand des SourceCodes nachvollziehen.

Source code Type Explanation Impact Cure Example PDF
  PdfMalformedException        

PDF objects

In general there are 8 object types and one special type (so 9 in all) that are supported by the PDF format. Six are scalar types (contain only one value/object) and three are container types that can contain multiple value. These are dictionary, array and stream. There are tools from Adobe which can be used for object analysis.

To do
Translate the rest.
  1. Boolean Objects: True or false
  2. Numeric Objects: Integer or real numbers
  3. String Objects: A sequence of 8-bit bytes, which represent text: Literal Strings, Hexadecimal Strings. PDF 1.7 allows for Text Strings, PDFDocEncoded Strings, ASCII Strings & Byte Strings.
  4. Name Object: Charakerfolge, die mit einem Slash („/“) eingeleitet wird. Leerzeichen und einige bestimmte Delimeter-Charaktere sind in Namen nicht erlaubt, können aber dargestellt werden, indem stattdessen der korrespondierende Hexadezimalcode verwendet wird.
  5. Array Object: Only one-dimensional arrays. All object types in an array are possible, even other arrays. Always displayed with [ ] .
  6. Dictionary Objects:
  7. Stream Objects: A sequence of bytes, die unbegrenzt lang sein können, ganz im Gegensatz zu String Objects. Ein Stream Object beginnt immer mit einem Dictionary, das die Byte-Sequenz beschreibt (Größe, Filter, Dekodierungsparameter) und dann folgt der Stream, der zwischen „stream“ und „endstream“ eingeschoben ist. Hier ein Beispiel:

2 0 obj
<</Length 39>>
stream
BT
/F1 12 Tf
72 712 Td (A short text stream.) Tj
ET
endstream
endobj

8.     Null Object: An einigen Stellen wird empfohlen, ein Objekt besser ganz zu löschen anstatt es auf null zu setzen. Im JHOVE-Code gibt es viele „== null“-Abfragen, die oftmals beim Zutreffen zu einer Exception führen.

9.     Indirect Object:

"Invalid name tree"

Source code Type Explanation Impact Cure Example PDF
NameTreeNode, line 91 PdfInvalidException        
NameTreeNode, line 94 PdfInvalidException        
NameTreeNode, line 97 PdfMalformedException        
NameTreeNode, line 160 PdfMalformedException        
NameTreeNode, line 166 PdfMalformedException        

"Improperly nested array delimiters"

Source code Type Explanation Impact Cure Example PDF
Parser, line 109 PdfMalformedException More array closing elements ("]") were encountered than array opening elements ("[").      

"Invalid object definition"

Source code Type Explanation Impact Cure Example PDF
Parser, line 208 PdfInvalidException        
Parser, line 225
Commented out
PdfInvalidException Same as above.      
Parser, line 226 PdfInvalidException        
Parser, line 227 PdfInvalidException        
Parser, line 229 PdfMalformedException       PDF from the Cabinet of Horrors

"Improper nesting of object streams"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2390 PdfMalformedException        

"Malformed filter"

Source code Type Explanation Impact Cure Example PDF
PdfStream, line 204 PdfMalformedException A filter has to be either an instance of the PdfDictionary or of the PdfArray. Otherwise, it is malformed. (To my humble understanding, needs to be checked.)      

java.lang.ClassCastException: PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary

Source code Type Explanation Impact Cure Example PDF
    This does not shown in the GUI, only in the java-library-version
I have a long German explanation which I can translate someday.
Seems to be a JHOVE bug and not a real PDF error.
    Example from the BSB.
Another example in a forum PDF.

Annotations

All annotations need to be well-formed. This is quite similar to the definition of a well-formed XML, but as an XML usually is far less complex, it is easier to tell and to parse.

Annotation example

"Annotations exist, but are not displayed; ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2748 InfoMessage        

"Invalid Annotation list"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2760 PdfMalformedException        

"Invalid Annotation property"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 3139 PdfMalformedException       PDF from the Cabinet of Horrors

Invalid characters, syntactic errors

"Invalid character in hex string"

Source code Type Explanation Impact Cure Example PDF
Literal, line 358 PdfMalformedException There is an if-statement which tests which HexValues are allowed/valid and which are not.
Invalid lead to an invalid PDF.
    The NLNZ has an example but it's not possible to share it.
Tokenizer, line 808 PdfMalformedException There is an if-statement which tests which HexValues are allowed/valid and which are not.
Invalid lead to an invalid PDF.
    The NLNZ has an example but it's not possible to share it.

Issues with colour management

The PDF format works with image data streams and not with image file formats. The most important filters/compressions are:

  • 1-bit data: Fax-compression group 3 or 4, JBIG2
  • Greyscale, RGB or CMYK data: JPEG, JPEG 2000 (DCTDecode is the filter JPEG uses)
  • usable for all kind of image data: ZIP
  • alternatively LZW can be used, though this is not possible in PDF/A-1 as the patent only expired in 2004
  • RLE (Run Length Encoding) is possible, but is uncommon due to its inefficiency

It is possible to embed the kind of data stream in a PDF which would also be used by a JPEG or JPEG 2000 file. Only the data stream is used which deals with the image itself, no information like metadata is added to that.

A TIFF image would be stored in a PDF e.g. like a JPEG, a TIFF itself cannot be embedded 1 to 1 in a PDF (which is possible with a JPEG).

"Compression method is invalid or unknown to JHOVE"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2435 PdfMalformedException Try-catch if "ZipException". Is zip the only kind of compression JHOVE knows? But there should be 4 other ones for image data streams.      

Interactive content

Interactive content often depends on external information, which can lead to problems and limited functionality. Sometimes fill-in-forms are presented differently.

Encryption

In general, JHOVE can deal with password protected PDF files. This does not lead to invalidity (exception: PDF/A). The boolean value "_encrypted" just is set on true. Some JHOVE versions even return this value in the output (German National Library's version, mine does not). So it should be possible to use JHOVE just to determine password protection, but of course JHOVE might be too "big" for such a relatively small task.

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1635 ErrorMessage        

Miscellaneous

"Invalid destination object"

Source code Type Explanation Impact Cure Example PDF
Destination, line 93 PdfInvalidException Unexpected error while constructing a destination object; or...
There are several valid destination objects:
An unnamed, direct destination, which refers to the page object.
An unnamed, indirect destination, which refers to a named, direct destination, which refers to the page object.

If it is no PdfArray and no PdfDictionary, this error is thrown. Can occur more than once in a PDF file.
     

"Invalid object number or object stream"

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2424 PdfMalformedException        
PdfModule, line 2440 PdfMalformedException        

"Lexical error"

Source code Type Explanation Impact Cure Example PDF
Tokenizer, line 362 PdfMalformedException        
Tokenizer, line 374 PdfMalformedException        

"Unexpected exception ..."

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 2146 ErrorMessage,
Malformed
Unexpected error while finding images.      

Variable

Source code Type Explanation Impact Cure Example PDF
PdfModule, line 1876 ErrorMessage        
PdfModule, line 2141 ErrorMessage        
PdfModule, line 3191 ErrorMessage,
Invalid
       

java.lang.NullPointerException

Source code Type Explanation Impact Cure Example PDF
    A bug in the source code. Too generic to be able to determine the impact of this error, depends on the location of the occurrence. Line numbers for these errors need to be noted and reported as issues.  

java.lang.OutOfMemoryError

JHOVE can run out of memory space during the PDF examination. Some examples are listed in the SourceForge Bug Reporter.

Source code Type Explanation Impact Cure Example PDF
    The PDF might be perfectly valid, there is just too much space needed to validate      

A possible reason might be a very big dictionary because of very many images. 10,000 images are no problem, but an unlimited number of images can lead into problems, if the PDF is built from very many images. (There is a nice use case of the Germany National Library, which we can probably borrow.)

Very big dictionary because of too many listed pictures workaround

The German National Library in Frankfurt has found that JHOVE causes the java heap space to run full if there are too many listed pictures in the PDF Dictionary. They have developed a workaround for this issue to keep java from failing.

PdfModule.java > findImages

The DEFAULT_MAX_IMAGES depend. PDF/A allows 4095 entries. Tests have shown that 10,000 would alos be OK. But no limit causes a heap space error around 1,251,900 entries. This will surely have more dependencies, so these numbers are from a test the German National Library has conducted.

To do
As the DNB has agreed to share this use case, this will be described in more detail soon.

JHOVE metadata extraction errors

JHOVE errors found as part of migration of image-based materials to Ex Libris' Rosetta by the State Library of New South Wales (SLNSW). Assistance in analysing some of these errors was provided by Digital Preservation staff at the National Library of New Zealand.

Metadata extraction from TIFF files

The following errors have been experienced with image-based materials:

  • Technical MD Extract:Fail - Error/s returned during metadata extraction (ColorSpace value out of range: 2)
    • Error analysis: This error occurred on a TIF file. JHOVE expected to see either value: “1” or “65535” (based on the TIFF specification). Instead the value it was encountering was "2".
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (FocalPlaneResolutionUnit value out of range: 4)
    • Error analysis: This error occurred on a TIF file. JHOVE expected to see a value in the range of: 1 - 3 (based on the TIFF specification). The value appearing is 4.
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (PhotometricInterpretation not defined,ImageWidth not defined,ImageLength not defined,Neither strips nor tiles defined,Neither strips nor tiles defined)
    • Error analysis: This error occurred on a TIF file. File was missing critical information and so image did not render (however it was not clear from this error message that the issue would result in a file not rendering.)
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 306; expecting 20; saw 19,Failed to retrieve extractor properties)
    • Error analysis: This error occurred on a TIF file. File should contain 20 bytes however there were only 19 (and so it did not meet the ISO datetime standard).
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (FileSource value out of range: 77)
    • Error analysis: This error occurred on a TIF file. File should contain the value 3 or 7 for this field. Instead it contains the value 77.
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 36867; expecting 20; saw 11,Failed to retrieve extractor properties)
    • Error analysis: This error occurred on a TIF file. File should contain 20 bytes however there were only 11 (and so it did not meet the DateTimeOriginal standard for the field, as per the TIFF specification. The TIFF spec states: "When the field is left blank, it is treated as unknown.").
  • Technical MD Extract:Fail - Error/s returned during metadata extraction (Tag 34665 out of sequence)
Labels:
jhove jhove Delete
pdf pdf Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.