Description
Some software applications produce PDFs that do not conform to the PDF format specification ( PDF 1.7 /ISO 32000-1 or the earlier pre-ISO specifications).
Risks
- PDF may not render correctly (or even render at all)
- Future migration to alternative format may result in loss of data (or it may fail altogether)
Assessment
Validation is problematic for PDF, mainly because of the complexity of the format and the lack of reliable tools.
- JHOVE includes a PDF module, but it doesn't support PDF 1.7 / ISO 32000 (yet?). In addition its principal author considers JHOVE to be "approaching the end of its life"
.
- The website of the PDF Association lists a number of commercially available tools
that do validation of either PDF (presumably ISO 32000?) and/or PDF/A.
Apache Preflight (part of Apache PDFBox) does not validate against the PDF format specification. However, it does include a Processing error category, which is described as "not necessarily a specific PDF/A validation error but a PDF specification requirement that isn't respected". Also, if Preflight raises an exception this may also indicate a malformed file.
Reference file | Description | Error Code(s) | Details |
sample file needed![]() |
Malformed PDF | 8 | Processing error – replace with actual error message |
sample file needed![]() |
Malformed PDF | 8.1 | Mandatory element missing (possibly malformed PDF) |
sample file needed![]() |
Malformed PDF | Exception |
Recommendations
Pre-ingest
- No authorative or generally accepted tools exist for PDF validation, but using Apache Preflight and checking its output for processing errors will at least detect PDFs that are seriously malformed.
Existing collections
- Use Apache Preflight and check for processing errors.
- In some cases it may be possible to obtain an intact version of malformed files from the original depositor/publisher.
References
- Are your documents readable? How would you know?
- blog post by Duff Johnson, contains link to presentation
on proposal for open source PDF validator that may be developed at some point.