View Source

h2. Description

Some software applications produce PDFs that do not conform to the PDF format specification ( [PDF 1.7 /ISO 32000-1|Portable Document Format^PDF32000_2008.pdf] or the earlier pre-ISO specifications).

h2. Risks

* PDF may not render correctly (or even render at all)
* Future migration to alternative format may result in loss of data (or it may fail altogether)

h2. Assessment
Validation is problematic for PDF, mainly because of the complexity of the format and the lack of reliable tools.

* [JHOVE] includes a PDF module, but it doesn't support PDF 1.7 / ISO 32000 (yet?). In addition its principal author considers JHOVE to be ["approaching the end of its life"|http://fileformats.wordpress.com/2013/10/01/tools/].
* The website of the PDF Association lists [a number of commercially available tools|http://www.pdfa.org/tag/pdfa-validation/] that do validation of either PDF (presumably ISO 32000?) and/or PDF/A.


_Apache Preflight_ (part of [Apache PDFBox]) does not validate against the PDF format specification. However, it does include a _Processing error_ category, which is described as "_not necessarily a specific PDF/A validation error but a PDF specification requirement that isn't respected_". Also, if _Preflight_ raises an exception this may also indicate a malformed file.

|*Reference file*|*Description*|*Error Code(s)*|*Details*|
|[*sample file needed*|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/updateThisLink]|Malformed PDF|8|*Processing error -- replace with actual error message*|
|[*sample file needed*|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/updateThisLink]|Malformed PDF|8.1|Mandatory element missing (possibly malformed PDF)|
|[*sample file needed*|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/updateThisLink]|Malformed PDF|Exception| |

h2. Recommendations

h3. Pre-ingest

* No authorative or generally accepted tools exist for PDF validation, but using [Apache Preflight|Apache PDFBox] and checking its output for processing errors will at least detect PDFs that are seriously malformed.

h3. Existing collections

* Use [Apache Preflight|Apache PDFBox] and check for processing errors.
* In some cases it may be possible to obtain an intact version of malformed files from the original depositor/publisher.

h2. References

* [Are your documents readable? How would you know?|http://duff-johnson.com/2014/01/24/are-your-documents-readable-how-would-you-know/] - blog post by Duff Johnson, contains link to [presentation|http://vimeopro.com/pdfassociation/technical-conference-europe-2013/video/68945979] on proposal for open source PDF validator that may be developed at some point.