View Source

In Preparation for the PDF Hackathon I list some of the error messages I have found in my large corpus of broken PDF files. If I have some more time on hands, I might do this with the error messages of JHOVE and PdfTron as well. For Adobe Acrobat Preflight I guess the [Callas Page|http://www.callassoftware.com/callas/doku.php/de:pdfakompakt:kapitel_7] already offers a great summary and possible solutions. I have 48 different error messages concerning PDF/A-1b in my corpus which I have documented (in German) in our own [wiki|http://zbwintern/wiki/display/dLZA/PDF+Fehlermeldungen] (not viewable publicly but available during the event).

My sample of 174 valid and 675 invalid PDF/A files contains 995 different error messages, some of them differ only slightly and have the same number (e. g. 1.0). ([Findings File|^PdfAValidationShortSummary.txt])

Reviewing all the error messages there are:

Errors in Syntax: 11196
Errors in Graphics: 117457
Errors in Fonts: 85993
Errors in Transparency: 60
Errors in Annotations: 2960
Errors in Action: 79
Errors in MetaData: 1520
Errors Summary: 219265

This means PDFBox has given 325 error message per PDF File in average. Can one tool be TOO talkative? (One error message can easily appear some dozen times per PDF file).




h4. Apache PDFBox Error Messages found

Comments:
* Some error messages tend to appear more than once per file.
* What appears within the " question marks is only an example and differs from file to file.

h6. 1 Syntax

1.0 : Syntax error, "Error: Expected a long type at offset 38776, instead got 'ref'" or "Syntax error, XREF for 19:0 points to wrong object: 6:0"
1.0.11 : Syntax error, Hexa string shall contain even number of non white space char
1.0.2 : Syntax error, Array too long : 10000
1.0.3 : Syntax error, Name too long
1.0.6 : Syntax error, Invalid integer range in a Number operands \|\| Numeric is too long or too small: 2147483648
1.2.1 : Body Syntax error, EOL expected before the 'endobj' keyword
1.2.10 : Body Syntax error, The operator "BX" isn't supported.
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword
1.2.5 : Body Syntax error, Stream length is invalide
1.2.10 : Body Syntax error, The operator "BX" isn't supported.

h6. 2 Graphics and Colours

2.1.2 : Invalid Graphis object, The OutputIntentCondition is missing
2.4 : Invalid Color space, Unable to read ICCBase color space. Caused by : Error: Unknown colorspace 'CalRGB'
2.4.3 : Invalid Color space, The operator "f" can't be used without Color Profile

h6. 3 Fonts

3.1.6 : Invalid Font definition, Width of the character "2" in the font program "EIOBCO+Times.New.Roman.Fett0217"is inconsistent with the width in the PDF dictionary.

h6. 4 Transparency


h6. 5 Annotation

5.1 : Missing field in an annotation definition

h6. Metadata

7.1 : Error on MetaData, No type defined for \{\[http://ns.adobe.com/xap/1.0/\|http://ns.adobe.com/xap/1.0/\]} Title

Comments:


//TODO: It might be interesting to count the error messages per PDF file and all in all as well to check which occurs most. 3.1.6. seems to be a candidate for this in my own sample. It must be easy because of the numeration at the beginning of the error messages. With JHOVE this was a little bit more challenging as it has to be searched for the strings. (E.g. string contains and then choose a decent part of each error message.)

h6. Errors occuring during running the tool

At least in my own application of the PDFBox library, there are some PDF-files, which do not get any validation results and therefore throw a nullpointerexception. I have found 44 and will bring them to the Hackathon.

The problem is exactly here:

{noformat}
try {
document.validate();
result = document
.getResult();

document.close();
}

catch (NullPointerException e) {

}
{noformat}
The tool does not crash, because the exception is cought but the files cannot be examined, obviously. The document is generated fine (I have tested this), it's the results, that do not work.

I have used the code described [here|https://pdfbox.apache.org/cookbook/pdfavalidation.html].

h5. Summary of my PDF/A corpus (891 PDF files)

Valid PDF/A-1b-files: 172

PDF/A files with errors: 673


PDF/A files that could not be parsed: 2


Could not be examined because of a null pointer exception (as decribed above): 44

It would be interesting to see if PDFTron comes to the same conclusion and what does JHOVE say about these files?
Testing ahead\!

h4. Testing PDFBox vs. PDFTron

I have used the [KOST-Val Tool|http://kost-ceco.ch/cms/index.php?id=248,434,0,0,1,0] (which has [PDFTron|http://www.pdftron.com/] embedded) of the switzerland colleagues.

h6. PDFBox valid PDF files

PDFBox has considered 172 files as PDF/A-1b compliant. PDFTron consideres 128 of these files as valid, the other 44 as invalid. Just to list the most favourite Error Messages PDFTron gives about the "PDFBox PDF/A1-b-compliant" files:
* The N entry does not match the number of color components in the embedded ICC profile (e_PDFA233)
* Device-specific color space used, but no GTS_PDFA1 OutputIntent (e_PDFA2331)
* CIDSet in subset font is incomplete (e_PDFA356)
* An interactive form field contains an action (e_PDFA91)
* Annotation is missing AP entry (e_PDFA5340)

h6. PDFBox invalid PDF files

I was curious if PDFTron considers any of the PDFBox-invalid PDF/A-1b files as valid.

There are indeed six files that PDFTron considers to be valid and PDFBox does not. Lucky for us, these are PDF files we can actually publish, as these are from the Isartor-Testsuite and edited by iText.
* Mig_iTextisartor-6-7-3-t01-fail-a.pdf
* Mig_iTextisartor-6-7-3-t01-fail-b.pdf
* Mig_iTextisartor-6-7-3-t01-fail-b_iText.pdf
* Mig_iTextisartor-6-7-3-t01-fail-c_iText.pdf
* Mig_iTextisartor-6-7-3-t01-fail-a_iText.pdf
* Mig_iTextisartor-6-7-3-t01-fail-c.pdf

All of the 6 files contain of one empty blank page.



h4. Testing JHOVE with PDF/A files

As JHOVE only has a very limited PDF/A Validation Test, I would not except too much. I need to work on my output module to have a nice way to check via JHOVE.
* 44 PDF files that caused a null pointer exception with PDFBox: Well-Formed and valid
* 44 files that are PDFBox valid, but PDFTron invalid: Well-Formed and valid
* 128 files that are PDFBox valid and PDFTron valid: Well-Formed and valid
* 6 files that are PDFBox invalid and PDFTron valid: Well-Formed and valid
* 664 PDFBox invalid and PDFTron invalid: 662 Well-Formed and valid, 1 malformed and 1 invalid
* 2 PDFBox could not parse: 1 Well-Formed and valid, 1 invalid

In conclusion, JHOVE considers most of these files to be valid. Just for testing, I have run my "PdfHorrorFiles" Folder and only 2 out of 37 are considered to be invalid. Let's just call JHOVE flexibel - furthermore, Standard PDF files do not have to be PDF/A-1b compliant, but some of the Horror Files definitely cause problems in the real life.

h5. Adobe Acrobat XI Preflight

As this is only a manual test, I have decided to pick some PDF files to try - especially the files that PDFTron thinks are valid and PDFBox does not.

Preflight findings: Invalid. XMP property is predefined but is not used in accordance with the definition.

h5. About the full PDF Sample brought to the Hackathon

Some are openly publishable, but some are only for testing during the Hackathon. There are 1299 PDF files all in all (81 are finding txt-finding-files within the folders).

+JHOVE examination states:+


PDF files Well-Formed and valid: 1363
PDF files malformed: 1
PDF files invalid: 16
Sample consists of 21 different JHOVE error messages:
1: 2 x   ErrorMessage: Annotation object is not a dictionary
2: 46 x   ErrorMessage: Compression method is invalid or unknown to JHOVE (This is due to the encrypted PDF files in the sample)
3: 10 x   ErrorMessage: Expected dictionary for font entry in page resource
4: 8 x   ErrorMessage: Improperly constructed page tree
5: 1 x   ErrorMessage: Improperly formed date
6: 1 x   ErrorMessage: Improperly nested array delimiters
7: 5 x   ErrorMessage: Invalid Annotation property
8: 1 x   ErrorMessage: Invalid Names dictionary
9: 1 x   ErrorMessage: Invalid Resources Entry in document
10: 9 x   ErrorMessage: Invalid destination object
11: 5 x   ErrorMessage: Invalid object definition
12: 19 x   ErrorMessage: Invalid object number in cross-reference stream
13: 1 x   ErrorMessage: Invalid object number or object stream
14: 13 x   ErrorMessage: Invalid page dictionary object
15: 4 x   ErrorMessage: Invalid page tree node
16: 2 x   ErrorMessage: Lexical error
17: 2 x   ErrorMessage: Malformed dictionary
18: 1 x   ErrorMessage: Malformed dictionary: Vector must contain an even number of objects, but has 29
19: 1 x   ErrorMessage: Malformed filter
20: 81 x   ErrorMessage: No PDF header (mostly txt-files)
21: 3 x   ErrorMessage: No PDF trailer