|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Comment:
Changes (2)
View Page History

Page description: This page is inteded to capture error messages during testing with JHOVE. It also contains some broad PDF knowledge.
\\
{toc:maxLevel=5}
//TODO: at the moment all the links to the source code lead to the CarlWilson-Version of JHOVE. Eventually, this should be changed to the openpreserve-Version of JHOVE.
//TODO: Maybe there could be some kind of impact gamut?
//TODO: Of course the possible cures have to be tested. Sometimes I just have to guess.
//TODO: The explanations can evolve to be much better, this is only a first try.
h1. PDF Module
Please note that there has not been an update of JHOVE (yet) since PDF 1.6. This is why JHOVE cannot determine the validity for PDF 1.7 (and higher) for sure, although it gives a nice clue about it of course. For the same reason JHOVE cannot really deal with PDF/A-2, as this is built on PDF 1.7.
JHOVE can throw two different kinds of exceptions: a _PDFMalformedException_ and a _PdfInvalidException_.
h3. Well-Formedness
To be considered as well-formed by JHOVE, a PDF must consist of:
* a PDF Header (starts with %PDF and the version of the PDF)
* a EOF-Tag (End-of-File-Tag)
* a body consisting of well-formed objects
* a cross-reference table
* a trailer defining the cross-reference table size
* an indirect reference to the document catalog dictionary
h3. Validity
A valid PDF must be well-formed. In Addition to that, is has to fulfill the following criteria:
* The file is well-formed.
* The document structure conforms to the specification. This includes (when present) outlines, pages, the page label tree, attributes, resources, role maps, name trees...
* Version information in the document catalog dictionary, if present, is properly formed.
* Dates are properly formed.
* File specifications are properly formed.
* Any annotations are properly formed.
* Any ArtBox, BleedBox, MediaBox and TrimBox items are PDF rectangles.
* XMP data, if present, are well-formed.
h2. Dictionary
dictionary: "collections of objects indexed by Names" (wikipedia)
A PDF file consists of PDF objects, which are referenced by the dictionary. The PDF dictionary is embedded between << these paratheses >>. An example: <</Subtype /Type1C/Length 886/Filter /FlateDecode>>
A PDF dictionary entry constits of a pair of values and always starts with a key which is introduced with a Slash ("/") and is followed by the value. Possible keys which are described in the PDF specification are:
* /LastChar: takes a number as a value
* /BaseFont: takes a string constant value
* /Type
* /Encoding
* /Subtype
* /Filter
* /Font Descriptor: points to another object
In theory, it is possible to add own keys, but these are then ignored by the Acrobat Reader. For long-term-availability, this does not seem to be a good idea anyway.
h4. Missing dictionary in document node (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [line 104 in DocNode class|https://github.com/carlwilson/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java] | This error/exception is thrown if there is no PdfDictionary. The code checks if "_dict == null" and if it is null (=not there), the error is thrown. As a Pdf Dictionary is mandatory for a well formed PDF, this error leads to a malformed PDF. | A missing Pdf Dictionary is a real error /lack, which should not be accepted. | Is it possible to build a Pdf Dictionary as an afterthought? Maybe iText can fix it. We (at ZBW) have an iText-Tool, which just copies each page into a new PDF. The PDF structure gets repaired by this procedure and I would guess that it would build a brand new PDF Dictionary for the PDF. I do not have any example by hand, though, so I cannot check. | |
h4. Invalid page dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 2846 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]. | | | | |
h4. Annotation dictionary missing required type (S) entry (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 3097 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\
Thrown like this: \\
throw new PdfMalformedException ("Annotation dictionary " + "missing required type (S) entry"); | | | | |
h4. Invalid page dictionary object (PdfMalformedException)
|| Source Code || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 1708 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]. | | | | |
h4. No document catalog dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule line 1347|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] String "nocat" | If the catalogue entry == null, the Error is thrown and SetWellFormed is set on false. \\ | | | We are allowed to use and share this [PDF|^grid-system.pdf], the producer has provided it as an example. \\ |
h4. Malformed dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[Parser in line 364|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]\\
\\ | If the error is thrown in the catch-Block, there are no further information. \\
Otherwise, it is saved as String "invalidDict" and the error can be "invalidDict" + ....some details. \\
An example can be if the vector has a non-even number of objects. An example of an error is: \\
Malformed dictionary: Vector must contain an even number of objects, but has 29 | | | An example for an error message here is: Malformed dictionary: Vector must contain an even number of objects, but has 29 |
h4. Malformed outline dictionary (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[PdfModule in line 3840|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\
but seems to be commented? \\ | | | | |
h4. Improperly nested dictionary delimiters (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 100 of the class[Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]. | If a certain value is less than 0, something about the order is wrong and the error is thrown \\ | | | |
h4. Invalid outline dictionary item (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PdfModule in line 3858|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3858] \\ | | | | \\ |
h4. Expected dictionary for font entry in page resource (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| not found in source code | | | | Cabinet of Horrors Sample has a [PDF example|^test_fontArialNotEmbedded.pdf] that can be openly used |
h4. Root entry missing in cross-ref stream dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1050 | error thrown if root entry == null \\
Example for a root entry: \\
<</Root 335 0 R/Info 333 0 R/ID\[\]/Size 347/Prev 37150797>> | | | |
h4. Invalid Prev offset in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| classe [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1094 \\ | An if/else checks wether some value is less than 0 \\ | | | |
h4. Invalid Size entry in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1115 \\ | If a value is either less than 0 or bigger than 8388607, this exception is thrown. \\
Obviously, appendix C is limited if it is a PDF/A and is not allowed to be bigger. \\ | | | |
h4. Size entry missing in trailer dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\
PdfModule in [line 1129|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] |
| \\ | | | | |
h4. Trailer dictionary Info key is not an indirect reference (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1138 \\ | | | | |
h4. Annotation object is not a dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 2752 \\ | | | | |
h4. Invalid algorithm value in encryption dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 1574 \\ | | | | |
h4. Outline dictionary missing required entry (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] in line 3811, but it's commented? \\ | | | | |
h4. Invalid dictionary data for page (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[PageObject in line 74|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageObject.java] as String "badPageStr" \\ | if entries in the dictionary are == null, this error is thrown \\ | | | |
h4. Invalid Names dictionary (invalid and/or malformed)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1453 | | | | |
h4. Invalid Dests dictionary (invalid and/or malformed)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1491 | | | | |
h4. Missing expected element in page number dictionary (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| 182 in class [PageLabelNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageLabelNode.java]\\ | if the PdfArray object == null this error is thrown \\ | | | |
h4. Invalid destination object (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[Destination|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Destination.java] in line 93 | There are several valid destination objects: \\
An unnamed, direct destination, which refers to the page object. \\
An unnamed, indirect destination, which refers to a named, direct destination, which refers to the page object. \\
\\
If it is no PDFArray and no PdfDictionary, this error is thrown. Can occur more than once in one PDF file. \\ | | | |
h4. Invalid Resources Entry in document (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line[102 in DocNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java]\\
\\
saved as string "invres" \\ | Can be thrown in 2 cases. If the entries are not "null"/are not there, this would lead to another error ("missing dictionary in document node") \\ | | | |
h4. Improperly formed date (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Class PDF [line 4099|https://github.com/openpreserve/jhove/blob/e573f424184c1b12c0445955ee79f559e94cf554/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\ | e. g. this date is not syntactically correct: \\
/CreationDate (Freitag, 11. Dezember 1998 14:47) \\
This would be correct: \\
XMP: \\
\\
<xmp:CreateDate>2014-12-09T10:25:10+01:00</xmp:CreateDate> \\
<xmp:ModifyDate>2014-12-09T10:25:10+01:00</xmp:ModifyDate> \\
Keyword-section: \\
<</Keywords()/ModDate(D:20141209102510+01'00')/CreationDate(D:20141209102510+01'00')/Producer(iText® 5.1.0 ©2000-2011 1T3XT BVBA)/Author(Andreas Knorr)/Title(Diskussionspapier Nr. 12)>> | | It may happen that after a "cure" there is no information about the creation date any more, if there are no xmp metadata in the original PDF. \\
The date is written "badly enough" that some tools might not regocnize the entry and do not translate it into the new / corrected PDF. \\ | use [this|https://econstor.eu/dspace/obitstream/10419/31712/1/605028710.PDF] as a reference, but find a better example (or build one) eventually \\ |
h4. Invalid outline dictionary object (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| not found in source code \\ | | | | |
h5.
h3. Fonts
Non embedded fonts are one of the biggest risks for the correct rendering of PDF files. If one of the used fonts is not embedded ind the PDF and the rendering device does not have the font, the PDF might not be rendered as the data producer once has intended. It even can lead to missing text, gaps within words or text shifting. The worst case would be that part of the text cannot be displayed correctly any more.
Some fonts cannot be embedded due to copyright reasons. Furthermore, there can be name conflicts. Somebody saves his font as "myfont", does not embedd the fonts and the rendering device also has a font named "myfont" and chooses this font to render the text - which is indeed a very different font and changes the visual impression of the PDF a lot.
It is not mandatory for ISO-3200 to mbedd fonts. A non-embedded font does not necesarily lead to an invalid PDF. With PDF/A, however, this is different, every used font has to be embedded.
Therefore, an absolute valid PDF can be at risk for long-term-availability if the fonts are not embedded. Here is an extrem example (from a slide from the PDF Days in Baseln in 2012): !fonts_notembedded.jpg|border=1!
h4. Invalid Font entry in Resources (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class[DocNode, line 138|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java] | A try-catch block catches entries in the PdfDicitonary if something is amiss. \\ | | | |
h4. Unexpected error in findFonts (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line [2248 in PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h4. Too many fonts to reports; some fonts omitted (Info Messages)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | The boundary should be at 1000 different fonts in one PDF \\ | | | |
h3. Cross-Reference Table
The cross-reference table serves which indexes all the objects in the PDF file. It is shown as an "byte offset" which displays the exact number of bytes from begin of the file where the object starts.This is useful as the software can find an object within the PDF file without having to scan the whole PDF. It is like an exact adress within the PDF file.
In contrast, this is not possible with a TIFF-file, because this is not linearised and that is why a TIFF file cannot be streamed.
An example for a cross-reference table:
xref
334 13
0000000023 00000 n
0000000547 00000 n
0000001140 00000 n
0000001328 00000 n
0000002384 00000 n
h4. Invalid cross-reference table
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule Line 1022|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h4. Invalid object number in cross-reference stream (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 1228 class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/embedded_video_avi.pdf?version=1&modificationDate=1400574373000] [Files|https://wiki.dnb.de/download/attachments/93783881/webCapture.pdf?version=1&modificationDate=1400574598000] in Cabinet of Horrors \\ |
h4. Illegal operator in xref table (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 1323 in class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | Legal operators seems to be "n" and "f". \\ | | | |
h3. (XMP-)Metadata
eXtensible Metadata Plattform
XMP is based on XML and XMP Metadata can be found not only in PDF (of course), but as well in TIFF, JPEG and other file formats. The most popular XMP scheme is Dublin Core, but there are others as well. XMP Metadata is possible since PDF 1.4, earlier versions should not contain XMP metadata. There is a SDK (Software Development Kit) to work with XMP directly from Adobe.
PDF/A asks for certain XMP metadata, usually Preflight will fix that easily.
h4. Invalid or ill-formed XMP-metadata (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1757 | | | | |
h3. File-Header
The Header usually has 1 or 2 lines. The first one is mandatory and can look like this: %PDF-1.7
The first four bytes have to be "%PDF", which is handy to check if it is a PDF file or not, because you only have to read the first four bytes.
The following data in the header usually shows to applications and software like email clients or file-transfer-software that it is binary data and not just plain ASCII-text.
h4. Invalid Version in document catalog (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 1447 | If the header and the dictionary do not show the same version, only an InfoMessage is shown. But the catch-block throws an error. \\ | | | |
h4. No PDF Header (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Not found in source code | | | | [PDF example|^fonts_notembedded.jpg]\\
| Not found in source code | | | | [PDF example|^fonts_notembedded.jpg]\\
that is indeed an xml but the user did not realise the PDF was not downloaded \\ | downloaded\\
\\
[PDF example 2 |^CERN-2005-009.pdf]\\
the PDF can be rendered fine - however, there are some extra values prior to the PDF header which make the header invalid \\ |
[PDF example 2 |^CERN-2005-009.pdf]\\
the PDF can be rendered fine - however, there are some extra values prior to the PDF header which make the header invalid \\ |


“_that specifies the location of some special objects (amongst which the cross-reference table)_” „_The trailer contains the location (byte position) of the cross-reference table, as well as some other special objects.“_
The structure looks like this:
<</Size s /Root r v R ... any other data >>
startxref #
498
%%EOF
s = How many entries (objects) does the xref table contain?
root = Root node of the PDF fiile
r = reference number
v = version of Object
{color:#333333}startref # shows where the xref table starts, afterwards the eof-tag follows.{color}
h4. Invalid PDF Trailer (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| \\ | Very often the upload of a PDF has stopped and the last part is missing. No %EOF can be found \\ | | | [PDF example|^567147525.pdf]\\ |
h4. Invalid ID in trailer
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 1146 in [PdfModule|https://github.com/openpreserve/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java]\\ | | | | |
h4. No PDF Trailer
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | Cannot be repaired (I guess), because the PDF is not complete \\ | |
h3. Page Tree & Pages
h4. Improperly constructed page tree (PdfMalformedException)
TODO: There is more info in the german wiki which has to be translated.
Das stammt aus der Java-Klasse[PageTreeNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageTreeNode.java].
Dieses [Beispiel-PDF|https://wiki.dnb.de/download/attachments/93783881/ImproperlyConstructedPageTree.pdf?version=1&modificationDate=1400575699000] kann als Beispiel genutzt werden, da es eigens zu Testzwecken erstellt wurde. Es gibt 2x den Fehler "improperly constructed page tree" aus und ansonsten keine weiteren Fehlermeldungen und wird von JHOVE als "not well-formed" eingestuft.
Während des[PDF Hackathon der OPF|http://openplanetsfoundation.org/blogs/2014-09-03-my-first-hackathon-hacking-pdf-files] (Open Presentation Foundation) gemeinsam mit der ZBW (Deutsche Zentralbibliothek für Wirtschaftswissenschaften) und Goportis (Leibniz-Bibliotheksverbund Forschungsinformation) in Hamburg wies Olaf Drümmer von der PDF Association auf eine interessante false negative Fehlermeldung von JHOVE hin.
Die Seiten einer PDF-Datei sind in der Regel in einem Page [Tree|http://en.wikipedia.org/wiki/Tree_%28data_structure%29] gespeichert, um möglichst rasch auf eine bestimmte Seite gelangen zu können\[[2]\|http://zbwintern/wiki/display/dLZA/Ein+PDF%2C+das+Jhove+als+solches+anerkennt#_ftn2\]. Dieser wird häufig als balancierter Page Tree gebildet. Obgleich der PDF-Standard auf diese Möglichkeit hinweist, schreibt er sie in keiner Weise vor.
Die Seiten können auch in einem einfachen Array aus Seiten gespeichert werden, auch das entspricht dem PDF-Standard. Es ist lediglich weniger effizient beim Seitenzugriff (schlechtere Performanz), vor allem wenn es sich um eine PDF-Datei mit besonders vielen Seiten handelt. JHOVE hingegen gibt es als Fehler aus, wenn die Seiten in einem Array anstatt in einem Page Tree gespeichert sind. Da dies kein Fehler ist und für die digitale Langzeitarchivierung nicht risikobehaftet, kann diese Meldung ignoriert werden.
Zitat aus dem PDF-Standard (ISO 32000-1 aka PDF 1.7) unter 7.7.3 Page Tree / 7.7.3.1 General:
"NOTE The simplest structure can consist of a single page tree node that references all of the document’s page objects directly. However, to optimize application performance, a conforming writer can construct trees of a particular form, known as balanced trees. Further information on this form of tree can be found in Data Structures and Algorithms, by Aho, Hopcroft, and Ullman (see the Bibliography)."
Es wird also rein informativ darauf hingwiesen, dass page trees sinnvoll sind. Allerdings muss man sich zu page trees außerhalb des PDF-Standards informieren (Quelle wird genannt). Es ist in keiner Weise vorgeschrieben, dass man page trees nutzen muss. Ein bestimmter Schwellwert wird nicht genannt - Leonard Rosenthol hat m. E. in seiner Monographie (Developing with PDF: Dive Into The Portable Document Format by Leonard Rosenthol, page 24) von 50 gesprochen, Olaf Drümmer hat berichtet, dass ein Adobe-Mitarbeiter ihm von einem Test erzählt hat, bei dem sie auf 64 gekommen sind, das hängt aber stark vom Material ab. Es ist davon auszugehen, dass es ungefähr in der Liga spielt, bei 5 oder 1000 liegen die Werte bestimmt nicht.
Weitere PDF-Dateien mit dieser Fehlermeldung weisen ebenfalls die Besonderheit auf, dass die Fehlermeldung 2x auftaucht. Könnte man ggf. anhand des SourceCodes nachvollziehen.
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | | | |
h4. Malformed MediaBox in page tree (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 154 in [DocNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/DocNode.java]\\ | There has to bei a rectangle: \\
PDF Rectangle: Any ArtBox, BleedBox, MediaBox and TrimBox must be compliant PDF rectangles. E.g. /Rect \[2 3 4 5\] which specifies the X and Y coordinates of the upper right and lower left corners of the rectangle. \\ | | | |
h4. Document page tree not found (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 1703 in [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | error is thrown if the \_pagesDictRef == null \\
It is filled with null at the beginning but should be filled with some value afterwards. \\ | | | |
h4. Invalid page label sequence (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| line 2893 [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h4. Invalid Page tree node (PdfInvalidException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [PageTreeNode|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PageTreeNode.java] | | | | |
h4. Problem with page label structure (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 2941 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h4. Bad page labels (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 2655 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h4. Invalid page label info (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Line 2735 of the class [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] | | | | |
h3. PDF Objects
In general there are 8 object types in a PDF and one special object type (so 9 all in all), that are supported by the PDF format. Six are scalar types (containt only one value/object) and three are container types that can contain more than one value. These are dicitionary, array and stream. There are tools from \[Adobe which can be used for the object analysis.
| \[https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/\]\] |
//TODO: translate the rest
# *Boolean Objects:* true / false
# *Numeric Objects:* integer or real numbers
# *String Objects:* Sequenz from 8Bit-Bytes, which represents text: Literal Strings, hexadecimal Strings. PDF 1.7 allows for Text Strings, PDFDocEncoded Strings, ASCII Strings & Byte Strings.
# \*Name Object: *Charakerfolge, die mit einem Slash („/“) eingeleitet wird. Leerzeichen und einige bestimmte Delimeter-Charaktere sind in Namen nicht erlaubt, können aber dargestellt werden, indem stattdessen der korrespondierende Hexadezimalcode verwendet wird.
# *Array Object:* only one-dimensional arrays. All object types in an array are possible, even other arrays. Always displayed with \[ \] .
# *Dictionary Objects:*
# *Stream Objects:* Eine Sequenz von Bytes, die unbegrenzt lang sein können, ganz im Gegensatz zu String Objects. Ein Stream Object beginnt immer mit einem Dictionary, das die Byte-Sequenz beschreibt (Größe, Filter, Dekodierungsparameter) und dann folgt der Stream, der zwischen „stream“ und „endstream“ eingeschoben ist. Hier ein Beispiel:
2 0 obj
<</Length 39>>
stream
BT
/F1 12 Tf
72 712 Td (A short text stream.) Tj
ET
endstream
endobj
*8. * *Null Object:* An einigen Stellen wird empfohlen, ein Objekt besser ganz zu löschen anstatt es auf null zu setzen. Im JHOVE-Code gibt es viele „== null“-Abfragen, die oftmals beim Zutreffen zu einer Exception führen.
*9. * *Indirect Objekt:*
h4. PdfMalformedException: Invalid name tree Offset: 541014 (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| thrown at runtime, not in source code \\ | can occur more than once in one PDF \\ | | | |
h4. java.lang.ClassCastException: PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | This does not shown in the GUI, only in the java-library-version \\
I have a long german explanation which I can translate someday. \\
Seems to be a JHOVE Bug and not a real PDF error. \\ | | | example from the [BSB |^grid-system.pdf]\\
another example in a forum [PDF|https://wiki.dnb.de/sourceforge.net/p/jhove/bugs/_discuss/thread/8d3d4539/e700/attachment/test.pdf] |
h4. Improperly nested array delimiters (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class [Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java] line 109 \\ | If a certain value is less than 0 this is an indicator for a wrong order. \\ | | | |
h4. Invalid object definition (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [class Parser|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Parser.java]\\ | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/corruptionOneByteMissing.pdf?version=1&modificationDate=1400574262000] from the Cabinett of Horrors \\ |
h4. Malformed filter (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfStream|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/PdfStream.java] line 204 \\ | A filter has to be either an instance of the PdfDictionary or of the PdfArray. Otherwise, it is malformed. (To my humble understanding, needs to be checked.) \\ | | | |
h4. Improper nesting of object streams (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2408 \\ | | | | |
h3. Annotations
All annotations need to be well-formed.This is quite similar to the definition of a well-formed xml, but as an xml usually is far less complex, it is easier to tell and to parse.
Example:22 0 obj
<< /Type /Annot
/Subtype /Text
/Rect [266 116 430 204]
/Contents (The quick brown fox jumped over the lazy dogs.)
>>
endobj
h4. Invalid Annotation property (+Bsp) (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 3159 \\ | | | | [PDF|https://wiki.dnb.de/download/attachments/93783881/externalLink.pdf?version=1&modificationDate=1400574455000] (from the Cabinet of Horrors) \\ |
h4. Invalid annotation list (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2780 \\ | | | | |
h3. Invalid characters, syntactic errors
h4. Invalid character in hex string (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| class literal line [360|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Literal.java] & \\
classe tokenizer in line [820|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Tokenizer.java]. | There is an if/else which tests which HexValues are allowed/valid and which are not. \\
Invalid lead to an invalid PDF. \\ | | | The NLNZ has an example but it's not possible to share it. \\ |
h3. Issues with Colour-Management
The PDF format works with image data streams and not with image file formats. The most important filter/compressions are:
* 1 Bit Data: Fax-Kompression group 3 or 4, [JBIG2|http://de.wikipedia.org/wiki/JBIG2]
* gry shades, RGB or CMYK Daten: *JPEG, JPEG2000* (*{_}DCTDecode is the filter JPEG uses{_}*)
* usable for all kind of image data: *ZIP*
* alternatively [LZW|http://de.wikipedia.org/wiki/Lempel-Ziv-Welch-Algorithmus#Patente] can be used, this is not posisble in PDF/A-1 as the patentdid not expiere before 2004
* *RLE (Run Length Encoding)* is possible, but is usually not used because it is not efficient
It is possible to embedd the kind of data stream in a PDF which would also be used by a JPEG or JPEG2000. Only the data stream is used which deals with the image itself, no information like metadata is added to that.
A TIFF image would be stored in a PDF e. g. like a JPEG, a TIFF itself cannot be embedded 1 to 1 in a PDF (which is possible with a JPEG).
h4. Compression method is invalid or unknown to JHOVE (PdfMalformedException)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| [PdfModule|https://github.com/gmcgath/jhove/blob/master/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java] line 2454 | ty/catch, if "ZipException". Is zip the only kind of compression JHOVE knows? But there should be 4 other ones for image data streams. \\ | | | |
h3. Interactive Content
Interactive content often depends on extern information, which can lead to problems and limited functionality. Sometimes fill-in-forms are presented differently.
h3. Passwordprotected PDF files
In general, JHOVE can deal with passwordprotected PDF files. This does not lead to invalidity (exception: PDF/A). The boolean value "_encrypted" just is set on true. Some JHOVE versions even return this value in the output (German National Library's version, mine does not). So it should be possible to use JHOVE just to determine Passwordprotection, but of course JHOVE might be too "big" for such a relatively small task.
h3. Miscellaneous
h4. Lexical Error (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| Not directly found in source code. (check "TokenMgrError") \\ | | | | |
h4. java.lang.NullPointerException (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | Can occur whenever some needed object is null. \\ | Too generic to be able to determine the impact for this error in general, depends on the occasion. \\ | | |
h4. java.lang.OutOfMemoryError (thrown at runtime)
|| Source Code \\ || Explanation \\ || Impact \\ || Cure \\ || PDF Example \\ ||
| | | The PDF might be perfectly valid, there is just too much space needed to validate \\ | | |
A possible reason might be a very big dictionary because of very many images. 10,000 images are no problem, but an unlimited number of images can lead into problems, if the PDF is built from very many images. (There is a nice use case of the Germany National Library, which we can probably borrow.)
h2. Workaround OutofMemory (Use Case German National Library)
JHOVE can run out of memory space during the PDF examination. Some exapmles are listed in the [SourceForge Bug Reporter|http://sourceforge.net/p/jhove/bugs/?source=navbar].
h4. Very big dictionary because of too many listed pictures
The German National Library in Frankfurt has found out that JHOVE causes the java heap space to run full if there are too many listed pictures in the PDF Dictionary. They have developed a workaround for this issue to keep java from failing.
PDFModule.java > findImages// DNB
// heins, 2014-10-30
if (_imagesList.size() <= DEFAULT_MAX_IMAGES) {
\_imagesList.add (prop);
}
The DEFAULT_MAX_IMAGES depend. PDF/A allows 4095 entries. Tests have shown that 10,000 would be also ok. But no limit causes a heap space error around 1,251,900 entries. This will surely have more dependencies, so these numbers are from a test the German National Library has conducted.
(//TODO: as the DNB has agreed to share this use case, this will be described in more detail soon)
h1. JHOVE Metadata Extraction Errors
JHOVE errors found as part of migration of image-based materials to Ex Libris' Rosetta by the State Library of New South Wales (SLNSW). Assistance in analysing some of these errors was provided by Digital Preservation staff at the National Library of New Zealand.
h3. Metadata Extraction from TIF files
The following errors have been experienced with image-based materials:
* Technical MD Extract:Fail - Error/s returned during metadata extraction (ColorSpace value out of range: 2)
** *Error analysis:* This error occurred on a TIF file. JHOVE expected to see either value: “1” or “65535” (based on the TIFF specification). Instead the value it was encountering was "2".
* Technical MD Extract:Fail - Error/s returned during metadata extraction (FocalPlaneResolutionUnit value out of range: 4)
** *Error analysis:* This error occurred on a TIF file. JHOVE expected to see a value in the range of: 1 - 3 (based on the TIFF specification). The value appearing is 4.
* Technical MD Extract:Fail - Error/s returned during metadata extraction (PhotometricInterpretation not defined,ImageWidth not defined,ImageLength not defined,Neither strips nor tiles defined,Neither strips nor tiles defined)
** *Error analysis:* This error occurred on a TIF file. File was missing critical information and so image did not render (however it was not clear from this error message that the issue would result in a file not rendering.)
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 306; expecting 20; saw 19,Failed to retrieve extractor properties)
** *Error analysis:* This error occurred on a TIF file. File should contain 20 bytes however there were only 19 (and so it did not meet the ISO datetime standard).
* Technical MD Extract:Fail - Error/s returned during metadata extraction (FileSource value out of range: 77)
** *Error analysis:* This error occurred on a TIF file. File should contain the value 3 or 7 for this field. Instead it contains the value 77.
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Count mismatch for tag 36867; expecting 20; saw 11,Failed to retrieve extractor properties)
** *Error analysis:* This error occurred on a TIF file. File should contain 20 bytes however there were only 11 (and so it did not meet the DateTimeOriginal standard for the field, as per the TIFF specification. The TIFF spec states: "When the field is left blank, it is treated as unknown.").
* Technical MD Extract:Fail - Error/s returned during metadata extraction (Tag 34665 out of sequence)
** *Error analysis:* This error occurred on a TIF file. Issue hasn't yet been fully analysed, however info can be found at TIFF specification: [http://www.awaresystems.be/imaging/tiff/tifftags/exififd.html]