Portable Document Format

Skip to end of metadata
Go to start of metadata

Summary

Name Portable Document Format
Description File format for platform-independent representation of formatted documents
MIME Type(s) application/pdf
PRONOM ID(s) fmt/14 ,fmt/15 ,fmt/16 ,fmt/17 ,fmt/18 ,fmt/19 ,fmt/20 ,fmt/276
UDFR ID(s) u1f91 ,u1f102 ,u1f113 ,u1f124 ,u1f135 ,u1f146 ,u1f158
Archive Team Wiki PDF
Library of Congress Digital Formats PDF
Wikipedia page(s) http://en.wikipedia.org/wiki/Portable_Document_Format
File extension(s) pdf
Format specification Adobe PDF References

Description

The Portable Document Format is intended to provide a platform-independent representation of formatted documents. It has its origins in (and is based on) the PostScript page description language. For preservation the most relevant aspects of the format are:

1. Its ubiquity
2. Its complexity and feature-richness
3. The inclusion of features that may be at odds with long-term accessibility

Versions and backward compatibility

Eight versions of the format have been published by Adobe (1.0-1.7); version 1.7 was later published as an ISO standard. In principle, newer versions are always backward-inclusive; however, the ISO 32000 edition contains the following statement:

The specifications for PDF are backward inclusive, meaning that PDF 1.7 includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. It should be noted that where Adobe removed certain features of PDF from their standard, they too are not contained herein.

ISO 32000 does not provide any information on which features have been removed during the evolution of the format.

PDF profiles

Finally, a number of formalised subsets (profiles) exist. Most relevant to digital preservation are PDF/A-1 (a subset of PDF 1.4), and PDF/A-2 and PDF/A-3 (both subsets of PDF 1.7). These profiles define sets of features that are aimed at optimising long-term accessibility. Two other profiles that are relevant to digital preservation are PDF/UA (Universal Access), which ensures optimal accessibility for people with disabilities, and PDF/X, which is targeted at the print industry.

Format issues

Not valid PDF

Encryption

Fonts missing, damaged or incomplete

JavaScript

References to external files

File attachments

Multimedia content

Detecting format issues with Apache Preflight

The following page summarises the detailed information from the individual 'format issue' pages above:

Summary of Apache Preflight errors

The following link points to a demo that shows how to automatically assess the output of Preflight against most of the issues mentioned above (includes elaborate Schematron rules file):

https://github.com/openplanets/pdfPolicyValidate

Resources

User Experiences

Tools

Apache PDFBox

PDFBox is an open-source PDF library, which includes a PDF/A-1b validator which is called Preflight. Validating a PDF against PDF/A1b reveals information about many features that are potential preservation risks (e.g. encryption, non-embedded fonts, multimedia). In principle this will work with any PDF (not just actual PDF/A documents!). The important thing is to filter out the error messages (i.e. violations of the PDF/A-1b profile) that correspond to specific risks.

peepdf

Peepdf is a tool for analyzing PDFs. It is mainly aimed at security/forensics applications (detecting harmful content), but much of its functionality looks really useful for preservation as well.

ExifTool

ExifTool's abilities to extract information from PDF files are quite limited, but it is one of the few tools that are provide detailed information about access rights and restrictions in encrypted/password-protected PDFs.

Labels:
format format Delete
pdf pdf Delete
document document Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.