View Source

h2. Summary

| Name | Portable Document Format |
| Description |File format for platform-independent representation of formatted documents|
| MIME Type(s) | [application/pdf |] |
| PRONOM ID(s) | [fmt/14 |],[fmt/15 |],[fmt/16 |],[fmt/17 |],[fmt/18 |],[fmt/19 |],[fmt/20 |],[fmt/276 |] |
| UDFR ID(s) | [u1f91 |],[u1f102 |],[u1f113 |],[u1f124 |],[u1f135 |],[u1f146 |],[u1f158 |] |
|Archive Team Wiki|[PDF|]|
|Library of Congress Digital Formats|[PDF|]|
| Wikipedia page(s) | |
| File extension(s) | pdf |
|Format specification | [Adobe PDF References|]

h2. Description

The Portable Document Format is intended to provide a platform-independent representation of formatted documents. It has its origins in (and is based on) the PostScript page description language. For preservation the most relevant aspects of the format are:

1. Its ubiquity
2. Its complexity and feature-richness
3. The inclusion of features that may be at odds with long-term accessibility

h3. Versions and backward compatibility
Eight versions of the format have been published by Adobe (1.0-1.7); version 1.7 was later published as an ISO standard. In principle, newer versions are always backward-inclusive; however, the ISO 32000 edition contains the following statement:

bq. The specifications for PDF are backward inclusive, meaning that PDF 1.7 includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. It should be noted that where Adobe removed certain features of PDF from their standard, they too are not contained herein.

ISO 32000 does not provide any information on _which_ features have been removed during the evolution of the format.

h3. PDF profiles
Finally, a number of formalised subsets (profiles) exist. Most relevant to digital preservation are PDF/A-1 (a subset of PDF 1.4), and PDF/A-2 and PDF/A-3 (both subsets of PDF 1.7). These profiles define sets of features that are aimed at optimising long-term accessibility. Two other profiles that are relevant to digital preservation are PDF/UA (Universal Access), which ensures optimal accessibility for people with disabilities, and PDF/X, which is targeted at the print industry.

h2. Format issues

h3. [Not valid PDF]

h3. [Encryption]

h3. [Fonts missing, damaged or incomplete]

h3. [JavaScript]

h3. [References to external files]

h3. [File attachments]

h3. [Multimedia content]

h2. Detecting format issues with Apache Preflight

The following page summarises the detailed information from the individual 'format issue' pages above:

[Summary of Apache Preflight errors]

The following link points to a demo that shows how to automatically assess the output of Preflight against most of the issues mentioned above (includes elaborate Schematron rules file):


h2. Resources

* [Adobe Acrobat Engineering website|] - Technical information on PDF and example files
* [PDF - Inventory of long-term preservation risks|]
* [What preservation risks are associated with the PDF file format? - Libraries and Information Sciences Stack Exchange (archived)|]
* [Identification of PDF preservation risks with Apache Preflight: a first impression|]
* [Identification of PDF preservation risks: the sequel|]
* [What do we mean by "embedded" files in PDF?|]

h2. User Experiences

* [Analysis of Acrobat Engineering PDFs with Acrobat Preflight and Apache Preflight]

h2. Tools

h3.[Apache PDFBox]
PDFBox is an open-source PDF library, which includes a _PDF/A-1b_ validator which is called _Preflight_. Validating a _PDF_ against _PDF/A1b_ reveals information about many features that are potential preservation risks (e.g. encryption, non-embedded fonts, multimedia). In principle this will work with _any_ PDF (not just actual _PDF/A_ documents!). The important thing is to filter out the error messages (i.e. violations of the _PDF/A-1b_ profile) that correspond to specific risks.

h3. [peepdf]
Peepdf is a tool for analyzing PDFs. It is mainly aimed at security/forensics applications (detecting harmful content), but much of its functionality looks really useful for preservation as well.

h3. [ExifTool]
ExifTool's abilities to extract information from PDF files are quite limited, but it is one of the few tools that are provide detailed information about access rights and restrictions in encrypted/password-protected PDFs.