|Name||Portable Document Format|
|Description||File format for platform-independent representation of formatted documents|
|PRONOM ID(s)||fmt/14 ,fmt/15 ,fmt/16 ,fmt/17 ,fmt/18 ,fmt/19 ,fmt/20 ,fmt/276|
|UDFR ID(s)||u1f91 ,u1f102 ,u1f113 ,u1f124 ,u1f135 ,u1f146 ,u1f158|
|Archive Team Wiki|
|Library of Congress Digital Formats|
|Format specification||Adobe PDF References|
The Portable Document Format is intended to provide a platform-independent representation of formatted documents. It has its origins in (and is based on) the PostScript page description language. For preservation the most relevant aspects of the format are:
1. Its ubiquity
2. Its complexity and feature-richness
3. The inclusion of features that may be at odds with long-term accessibility
Versions and backward compatibility
Eight versions of the format have been published by Adobe (1.0-1.7); version 1.7 was later published as an ISO standard. In principle, newer versions are always backward-inclusive; however, the ISO 32000 edition contains the following statement:
The specifications for PDF are backward inclusive, meaning that PDF 1.7 includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. It should be noted that where Adobe removed certain features of PDF from their standard, they too are not contained herein.
ISO 32000 does not provide any information on which features have been removed during the evolution of the format.
Finally, a number of formalised subsets (profiles) exist. Most relevant to digital preservation are PDF/A-1 (a subset of PDF 1.4), and PDF/A-2 and PDF/A-3 (both subsets of PDF 1.7). These profiles define sets of features that are aimed at optimising long-term accessibility. Two other profiles that are relevant to digital preservation are PDF/UA (Universal Access), which ensures optimal accessibility for people with disabilities, and PDF/X, which is targeted at the print industry.
Not valid PDF
Fonts missing, damaged or incomplete
References to external files
Detecting format issues with Apache Preflight
The following page summarises the detailed information from the individual 'format issue' pages above:
Summary of Apache Preflight errors
The following link points to a demo that shows how to automatically assess the output of Preflight against most of the issues mentioned above (includes elaborate Schematron rules file):
- Adobe Acrobat Engineering website - Technical information on PDF and example files
- PDF - Inventory of long-term preservation risks
- What preservation risks are associated with the PDF file format? - Libraries and Information Sciences Stack Exchange (archived)
- Identification of PDF preservation risks with Apache Preflight: a first impression
- Identification of PDF preservation risks: the sequel
- What do we mean by "embedded" files in PDF?
PDFBox is an open-source PDF library, which includes a PDF/A-1b validator which is called Preflight. Validating a PDF against PDF/A1b reveals information about many features that are potential preservation risks (e.g. encryption, non-embedded fonts, multimedia). In principle this will work with any PDF (not just actual PDF/A documents!). The important thing is to filter out the error messages (i.e. violations of the PDF/A-1b profile) that correspond to specific risks.
Peepdf is a tool for analyzing PDFs. It is mainly aimed at security/forensics applications (detecting harmful content), but much of its functionality looks really useful for preservation as well.
ExifTool's abilities to extract information from PDF files are quite limited, but it is one of the few tools that are provide detailed information about access rights and restrictions in encrypted/password-protected PDFs.