|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (1)
View Page History{note}This page is under construction. Please feel free to contribute, add, edit and correct :-){note}
Page description: This page is inteded to capture error messages during testing with JHOVE. Besides, it It also contains some broad PDF knowledge.
\\


{toc:maxLevel=5}
h2. PDF Module
Please note that there has not been an update of JHOVE (yet) since PDF 1.6. This is why JHOVE cannot determine the validity for PDF 1.7 (and higher) for sure, although it gives a nice clue about it of course. For the same reason JHOVE cannot really deal with PDF/A-2, as this is built on PDF 1.7.
JHOVE can throw two different kinds of exceptions: a _PDFMalformedException_ and a _PdfInvalidException_.
h3. Well-Formedness
To be considered as well-formed by JHOVE, a PDF must consist of:
* a PDF Header (starts with %PDF and the version of the PDF)
* a EOF-Tag (End-of-File-Tag)
* a body consisting of well-formed objects
* a cross-reference table
* a trailer defining the cross-reference table size
* an indirect reference to the document catalog dictionary
h3. Validity
A valid PDF must be well-formed. In Addition to that, is has to fulfill the following criteria:
* The file is well-formed.
* The document structure conforms to the specification. This includes (when present) outlines, pages, the page label tree, attributes, resources, role maps, name trees...
* Version information in the document catalog dictionary, if present, is properly formed.
* Dates are properly formed.
* File specifications are properly formed.
* Any annotations are properly formed.
* Any ArtBox, BleedBox, MediaBox and TrimBox items are PDF rectangles.
* XMP data, if present, are well-formed.
h3. Error Messages (Exceptions) in the JHOVE PDF Module
//TODO: at the moment all the links to the source code lead to the CarlWilson-Version of JHOVE. Eventually, this should be changed to the openpreserve-Version of JHOVE.
//TODO: Maybe there could be some kind of impact gamut?
//TODO: Of course the possible cures have to be tested. Sometimes I just have to guess.
//TODO: The explanations can evolve to be much better, this is only a first try.
h4. Issues with the PDF Dictionary
//TODO: Add Info about Info Dictionary. It would be good to add some summarizing info about the PDF Dictionary.
There is some info about dictionary objects as well.
h5. Missing dictionary in document node (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [line 104 in DocNode class|https://github.com/carlwilson/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java] | This error/exception is thrown if there is no PdfDictionary. The code checks if "_dict == null" and if it is null (=not there), the error is thrown. As a Pdf Dictionary is mandatory for a well formed PDF, this error leads to a malformed PDF. | A missing Pdf Dictionary is a real error /lack, which should not be accepted. | Is it possible to build a Pdf Dictionary as an afterthought? Maybe iText can fix it. We (at ZBW) have an iText-Tool, which just copies each page into a new PDF. The PDF structure gets repaired by this procedure and I would guess that it would build a brand new PDF Dictionary for the PDF. I do not have any example by hand, though, so I cannot check. | |
h5. Invalid page dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 2846 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]. | | | | |
h5. Annotation dictionary missing required type (S) entry (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 3097 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\
Thrown like this: \\
throw new PdfMalformedException ("Annotation dictionary " + "missing required type (S) entry"); | | | | |
h5. Invalid page dictionary object (PdfMalformedException)
|| Source Code || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 1708 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]. | | | | |
h5. No document catalog dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule line 1347|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] String "nocat" | If the catalogue entry == null, the Error is thrown and SetWellFormed is set on false. \\ | | | We are allowed to use and share this [PDF|^grid-system.pdf], the producer has provided it as an example. \\ |
h5. Malformed dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[Parser in line 364|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]\\
\\ | If the error is thrown in the catch-Block, there are no further information. \\
Otherwise, it is saved as String "invalidDict" and the error can be "invalidDict" + ....some details. \\
An example can be if the vector has a non-even number of objects. An example of an error is: \\
Malformed dictionary: Vector must contain an even number of objects, but has 29 | | | |
h5. Malformed outline dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[PdfModule in line 3840|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\
but seems to be commented? \\ | | | | |
h5. Improperly nested dictionary delimiters (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 100 of the class[Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]. | If a certain value is less than 0, something about the order is wrong and the error is thrown \\ | | | |
h5. Invalid outline dictionary item (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| not found in source code \\ | | | | \\ |
h5. Expected dictionary for font entry in page resource (+Bsp) (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| not found in source code | | | | Cabinet of Horrors Sample has a [PDF example|^test_fontArialNotEmbedded.pdf] that can be openly used |
h5. Root entry missing in cross-ref stream dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1050 | error thrown if root entry == null \\
Example for a root entry: \\
<</Root 335 0 R/Info 333 0 R/ID\[\]/Size 347/Prev 37150797>> | | | |
h5. Invalid Prev offset in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| classe [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1094 \\ | An if/else checks wether some value is less than 0 \\ | | | |
h5. Invalid Size entry in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1115 \\ | If a value is either less than 0 or bigger than 8388607, this exception is thrown. \\
Obviously, appendix C is limited if it is a PDF/A and is not allowed to be bigger. \\ | | | |
h5. Size entry missing in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\ |
| PdfModule in [line 1129|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h5. Trailer dictionary Info key is not an indirect reference (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1138 \\ | | | | |
h5. Annotation object is not a dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 2752 \\ | | | | |
h5. Invalid algorithm value in encryption dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1574 \\ | | | | |
h5. Outline dictionary missing required entry (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 3811, but it's commented? \\ | | | | |
h5. Invalid dictionary data for page (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[PageObject in line 74|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java] as String "badPageStr" \\ | if entries in the dictionary are == null, this error is thrown \\ | | | |
h5. Invalid Names dictionary (invalid and/or malformed)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1453 | | | | |
h5. Invalid Dests dictionary (invalid and/or malformed)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1491 | | | | |
h5. Missing expected element in page number dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| 182 in class [PageLabelNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageLabelNode.java]\\ | if the PdfArray object == null this error is thrown \\ | | | |
h5. Invalid destination object (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[Destination|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Destination.java] in line 93 | There are several valid destination objects: \\
An unnamed, direct destination, which refers to the page object. \\
An unnamed, indirect destination, which refers to a named, direct destination, which refers to the page object. \\
\\
If it is no PDFArray and no PdfDictionary, this error is thrown. Can occur more than once in one PDF file. \\ | | | |
h5. Invalid Resources Entry in document (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Improperly formed date (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Class PDF [line 4099|https://github.com/openpreserve/jhove/blob/e573f424184c1b12c0445955ee79f559e94cf554/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\ | | | | use [this|https://econstor.eu/dspace/obitstream/10419/31712/1/605028710.PDF] as a reference, but find a better example (or huild one) eventually \\ |
h5. Invalid outline dictionary object (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. OutOfMemoryError: Very big dictionary because of too many listed pictures
The German National Library in Frankfurt has found out that JHOVE causes the java heap space to run full if there are too many listed pictures in the PDF Dictionary. They have developed a workaround for this issue to keep java from failing.
PDFModule.java > findImages
{code}
// DNB
// heins, 2014-10-30
if (_imagesList.size() <= DEFAULT_MAX_IMAGES) {
_imagesList.add (prop);
}
{code}
The DEFAULT_MAX_IMAGES depend. PDF/A allows 4095 entries. Tests have shown that 10,000 would be also ok. But no limit causes a heap space error around 1,251,900 entries. This will surely have more dependencies, so these numbers are from a test the German National Library has conducted.
(//TODO: as the DNB has agreed to share this use case, this will be described in more detail soon)
h4. Fonts
Non embedded fonts are one of the biggest risks for the correct rendering of PDF files. If one of the used fonts is not embedded ind the PDF and the rendering device does not have the font, the PDF might not be rendered as the data producer once has intended. It even can lead to missing text, gaps within words or text shifting. The worst case would be that part of the text cannot be displayed correctly any more.
Some fonts cannot be embedded due to copyright reasons. Furthermore, there can be name conflicts. Somebody saves his font as "myfont", does not embedd the fonts and the rendering device also has a font named "myfont" and chooses this font to render the text - which is indeed a very different font and changes the visual impression of the PDF a lot.
It is not mandatory for ISO-3200 to mbedd fonts. A non-embedded font does not necesarily lead to an invalid PDF. With PDF/A, however, this is different, every used font has to be embedded.
Therefore, an absolute valid PDF can be at risk for long-term-availability if the fonts are not embedded. Here is an extrem example (from a slide from the PDF Days in Baseln in 2012): !fonts_notembedded.jpg|border=1!
h5. Invalid Font entry in Resources (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[DocNode, line 138|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java] | A try-catch block catches entries in the PdfDicitonary if something is amiss. \\ | | | |
h5. Unexpected error in findFonts (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line[2248 in PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h5. Too many fonts to reports; some fonts omitted (Info Messages)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | The boundary should be at 1000 different fonts in one PDF \\ | | | |
h4. Cross-Reference Table
The cross-reference table serves which indexes all the objects in the PDF file. It is shown as an "byte offset" which displays the exact number of bytes from begin of the file where the object starts.This is useful as the software can find an object within the PDF file without having to scan the whole PDF. It is like an exact adress within the PDF file.
In contrast, this is not possible with a TIFF-file, because this is not linearised and that is why a TIFF file cannot be streamed.
An example for a cross-reference table:
xref
334 13
0000000023 00000 n
0000000547 00000 n
0000001140 00000 n
0000001328 00000 n
0000002384 00000 n
h5. Invalid cross-reference table
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule Line 1022|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h5. Invalid object number in cross-reference stream (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 1228 class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/embedded_video_avi.pdf?version=1&modificationDate=1400574373000] [Files|https://wiki.dnb.de/download/attachments/93783881/webCapture.pdf?version=1&modificationDate=1400574598000] in Cabinet of Horrors \\ |
h5. Illegal operator in xref table (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 1323 in class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | Legal operators seems to be "n" and "f". \\ | | | |
h4. (XMP-)Metadata
eXtensible Metadata Plattform
XMP is based on XML and XMP Metadata can be found not only in PDF (of course), but as well in TIFF, JPEG and other file formats. The most popular XMP scheme is Dublin Core, but there are others as well. XMP Metadata is possible since PDF 1.4, earlier versions should not contain XMP metadata. There is a SDK (Software Development Kit) to work with XMP directly from Adobe.
PDF/A asks for certain XMP metadata, usually Preflight will fix that easily.
h5. Invalid or ill-formed XMP-metadata (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1757 | | | | |
h4. File-Header
The Header usually has 1 or 2 lines. The first one is mandatory and can look like this: %PDF-1.7
The first four bytes have to be "%PDF", which is handy to check if it is a PDF file or not, because you only have to read the first four bytes.
The following data in the header usually shows to applications and software like email clients or file-transfer-software that it is binary data and not just plain ASCII-text.
h5. Invalid Version in document catalog (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1447 | If the header and the dictionary do not show the same version, only an InfoMessage is shown. But the catch-block throws an error. \\ | | | |
h5. No PDF Header (+Bsp) (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Not found in source code | | | | [PDF example|^fonts_notembedded.jpg] |
h4. PDF Trailer
“_that specifies the location of some special objects (amongst which the cross-reference table)_” „_The trailer contains the location (byte position) of the cross-reference table, as well as some other special objects.“_
h5. Invalid PDF Trailer (+Bsp) (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| \\ | Very often the upload of a PDF has stopped and the last part is missing. No %EOF can be found \\ | | | [PDF example|^567147525.pdf]\\ |
h5. Invalid ID in trailer
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. No PDF Trailer
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h4. Page Tree & Pages
h5. Improperly constructed page tree (+Bsp) (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Malformed MediaBox in page tree (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Document page tree not found (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Invalid page label sequence (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Invalid Page tree node (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Problem with page label structure (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Bad page labels (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h5. Invalid page label info (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h4. PDF Objects
In general there are 8 object types in a PDF and one special object type (so 9 all in all), that are supported by the PDF format. Six are scalar types (containt only one value/object) and three are container types that can contain more than one value. These are dicitionary, array and stream. There are tools from \[Adobe which can be used for the object analysis.
| \[https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/\]\] |
//TODO: translate the rest
# *Boolean Objects:* true / false
# *Numeric Objects:* integer or real numbers
# *String Objects:* Sequenz from 8Bit-Bytes, which represents text: Literal Strings, hexadecimal Strings. PDF 1.7 allows for Text Strings, PDFDocEncoded Strings, ASCII Strings & Byte Strings.
# \*Name Object: *Charakerfolge, die mit einem Slash („/“) eingeleitet wird. Leerzeichen und einige bestimmte Delimeter-Charaktere sind in Namen nicht erlaubt, können aber dargestellt werden, indem stattdessen der korrespondierende Hexadezimalcode verwendet wird.
# *Array Object:* only one-dimensional arrays. All object types in an array are possible, even other arrays. Always displayed with \[ \] .
# *Dictionary Objects:*
# *Stream Objects:* Eine Sequenz von Bytes, die unbegrenzt lang sein können, ganz im Gegensatz zu String Objects. Ein Stream Object beginnt immer mit einem Dictionary, das die Byte-Sequenz beschreibt (Größe, Filter, Dekodierungsparameter) und dann folgt der Stream, der zwischen „stream“ und „endstream“ eingeschoben ist. Hier ein Beispiel:
2 0 obj
<</Length 39>>
stream
BT
/F1 12 Tf
72 712 Td (A short text stream.) Tj
ET
endstream
endobj
*8. * *Null Object:* An einigen Stellen wird empfohlen, ein Objekt besser ganz zu löschen anstatt es auf null zu setzen. Im JHOVE-Code gibt es viele „== null“-Abfragen, die oftmals beim Zutreffen zu einer Exception führen.
*9. * *Indirect Objekt:*
h5. PdfMalformedException: Invalid name tree Offset: 541014 (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| thrown at runtime, not in source code \\ | can occur more than once in one PDF \\ | | | |
h5. java.lang.ClassCastException: PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | This does not shown in the GUI, only in the java-library-version \\
I have a long german explanation which I can translate someday. \\
Seems to be a JHOVE Bug and not a real PDF error. \\ | | | example from the [BSB |^grid-system.pdf]\\
another example in a forum [PDF|https://wiki.dnb.de/sourceforge.net/p/jhove/bugs/_discuss/thread/8d3d4539/e700/attachment/test.pdf] |
h5. Improperly nested array delimiters (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java] line 109 \\ | If a certain value is less than 0 this is an indicator for a wrong order. \\ | | | |
h5. Invalid object definition (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [class Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]\\ | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/corruptionOneByteMissing.pdf?version=1&modificationDate=1400574262000] from the Cabinett of Horrors \\ |
h5. Malformed filter (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfStream|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PdfStream.java] line 204 \\ | A filter has to be either an instance of the PdfDictionary or of the PdfArray. Otherwise, it is malformed. (To my humble understanding, needs to be checked.) \\ | | | |
h5. Improper nesting of object streams (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2408 \\ | | | | |
h4. Annotations
All annotations need to be well-formed.This is quite similar to the definition of a well-formed xml, but as an xml usually is far less complex, it is easier to tell and to parse.
Example:22 0 obj
<< /Type /Annot
/Subtype /Text
/Rect [266 116 430 204]
/Contents (The quick brown fox jumped over the lazy dogs.)
>>
endobj
h5. Invalid Annotation property (+Bsp) (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 3159 \\ | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/externalLink.pdf?version=1&modificationDate=1400574455000] (from the Cabinet of Horrors) \\ |
h5. Invalid annotation list (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2780 \\ | | | | |
h4. Invalid characters, syntactic errors
h5. Invalid character in hex string (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class literal line [360|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Literal.java] & \\
classe tokenizer in line [820|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Tokenizer.java]. | There is an if/else which tests which HexValues are allowed/valid and which are not. \\
Invalid lead to an invalid PDF. \\ | | | The NLNZ has an example but it's not possible to share it. \\ |
h4. Issues with Colour-Management
The PDF format works with image data streams and not with image file formats. The most important filter/compressions are:
* 1 Bit Data: Fax-Kompression group 3 or 4, [JBIG2|http://de.wikipedia.org/wiki/JBIG2]
* gry shades, RGB or CMYK Daten: *JPEG, JPEG2000* (*{_}DCTDecode is the filter JPEG uses{_}*)
* usable for all kind of image data: *ZIP*
* alternatively [LZW|http://de.wikipedia.org/wiki/Lempel-Ziv-Welch-Algorithmus#Patente] can be used, this is not posisble in PDF/A-1 as the patentdid not expiere before 2004
* *RLE (Run Length Encoding)* is possible, but is usually not used because it is not efficient
It is possible to embedd the kind of data stream in a PDF which would also be used by a JPEG or JPEG2000. Only the data stream is used which deals with the image itself, no information like metadata is added to that.
A TIFF image would be stored in a PDF e. g. like a JPEG, a TIFF itself cannot be embedded 1 to 1 in a PDF (which is possible with a JPEG).
h5. Compression method is invalid or unknown to JHOVE (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2454 | ty/catch, if "ZipException". Is zip the only kind of compression JHOVE knows? But there should be 4 other ones for image data streams. \\ | | | |
h4. Interactive Content
Interactive content often depends on extern information, which can lead to problems and limited functionality. Sometimes fill-in-forms are presented differently.
h4. Passwordprotected PDF files
In general, JHOVE can deal with passwordprotected PDF files. This does not lead to invalidity (exception: PDF/A). The boolean value "_encrypted" just is set on true. Some JHOVE versions even return this value in the output (German National Library's version, mine does not). So it should be possible to use JHOVE just to determine Passwordprotection, but of course JHOVE might be too "big" for such a relatively small task.
h4. Miscellaneous
h5. Lexical Error (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Not directly found in source code. (check "TokenMgrError") \\ | | | | |
h5. java.lang.NullPointerException (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | Can occur whenever some needed object is null. \\ | Too generic to be able to determine the impact for this error in general, depends on the occasion. \\ | | |
h5. java.lang.OutOfMemoryError (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | The PDF might be perfectly valid, there is just too much space needed to validate \\ | | |
A possible reason might be a very big dictionary because of very many images. 10,000 images are no problem, but an unlimited number of images can lead into problems, if the PDF is built from very many images. (There is a nice use case of the Germany National Library, which we can probably borrow.)