Not valid PDF

Skip to end of metadata
Go to start of metadata

Description

Some software applications produce PDFs that do not conform to the PDF format specification ( PDF 1.7 /ISO 32000-1 or the earlier pre-ISO specifications).

Risks

  • PDF may not render correctly (or even render at all)
  • Future migration to alternative format may result in loss of data (or it may fail altogether)

Assessment

Validation is problematic for PDF, mainly because of the complexity of the format and the lack of reliable tools.

Apache Preflight (part of Apache PDFBox) does not validate against the PDF format specification. However, it does include a Processing error category, which is described as "not necessarily a specific PDF/A validation error but a PDF specification requirement that isn't respected". Also, if Preflight raises an exception this may also indicate a malformed file.

Reference file Description Error Code(s) Details
sample file needed Malformed PDF 8 Processing error – replace with actual error message
sample file needed Malformed PDF 8.1 Mandatory element missing (possibly malformed PDF)
sample file needed Malformed PDF Exception  

Recommendations

Pre-ingest

  • No authorative or generally accepted tools exist for PDF validation, but using Apache Preflight and checking its output for processing errors will at least detect PDFs that are seriously malformed.

Existing collections

  • Use Apache Preflight and check for processing errors.
  • In some cases it may be possible to obtain an intact version of malformed files from the original depositor/publisher.

References

Labels:
formatissue formatissue Delete
pdf pdf Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.