View Source

h2. Description
PDFs may use fonts that are either not embedded in the file, damaged or incomplete.

h2. Risks
If fonts are not embedded, or if embedded fonts are damaged or otherwise incomplete, PDFs may be rendered incorrectly.

h2. Assessment
The following table shows the relevant output of _Apache Preflight_ (part of [Apache PDFBox]) for PDFs with non-embedded fonts. Results obtained with _Preflight_ 2.0.0:

|*Reference file*|*Description*|*Error Code(s)*|*Details*|
|[text_only_fontsNotEmbedded.pdf|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/text_only_fontsNotEmbedded.pdf]|Used fonts are not embedded|3.1.3|Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRomanPSMT|
|[test_fontArialNotEmbedded.pdf|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/test_fontArialNotEmbedded.pdf]|Some fonts not embedded (other fonts are)|3.1.3|Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,BoldItalic / Arial /TimesNewRoman / ...(multiple messages)|

However, _Preflight_ is able to report many more font issues, which are broadly subdivide into the following categories:

# Invalid or incomplete font dictionary errors. This includes a wide range of problems, including fonts that are not embedded.
# Damaged embedded font errors.
# Glyph errors.

The table below shows all possible errors; the descriptions are taken from the comments in [Preflight's source code|http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java].

|*Error code*|*Description*|
|3|Main error code for font problems|
| |*Invalid or incomplete font data errors*|
|3.1|Main error code for invalid data in font|
|3.1.1|Some mandatory fields are missing from the FONT Dictionary|
|3.1.2|Some mandatory fields are missing from the FONT Descriptor Dictionary|
|3.1.3|Error on the "Font File x" in the Font Descriptor (ex : FontFile and FontFile2 are present in the same dictionary)|
|3.1.4|Charset declaration is missing in a Type 1 Subset|
|3.1.5|Encoding is inconsistent with the Font (ex : Symbolic TrueType mustn't declare encoding)|
|3.1.6|Width array and Font program Width are inconsistent|
|3.1.7|Required entry in a Composite Font dictionary is missing|
|3.1.8|The CIDSystemInfo dictionary is invalid|
|3.1.9|The CIDToGID is invalid|
|3.1.10|The CMap of the Composite Font is missing or invalid|
|3.1.11|The CIDSet entry i mandatory from a subset of composite font|
|3.1.12|The CMap of the Composite Font is missing or invalid|
|3.1.13|Encoding entry can't be read due to IOException|
|3.1.14|The font type is unknown|
| |*Damaged embedded font errors*|
|3.2|The embedded font is damaged|
|3.2.1|The embedded Type1 font is damaged|
|3.2.2|The embedded TrueType font is damaged|
|3.2.3|The embedded composite font is damaged|
|3.2.4|The embedded type 3 font is damaged|
|3.2.5|The embedded CID Map is damaged|
| |*Glyph errors*|
|3.3|Common error for a Glyph problem|
|3.3.1|a glyph is missing|
|3.3.2|a glyph is missing|

Not all of these errors are equally "serious" (e.g. errors 3.1.4, 3.1.5 and 3.1.6 appear to be relatively harmless). It may be advisable to consider the presence of _any_ of the above errors (maybe except 3.1.4, 3.15 and 3.1.6) to be indicative of a font-related issue, although this may be overly restrictive in some cases (this section needs more work / examples/ evidence).

h3. Note on non-embedded fonts
Based on a number of tests, non-embedded fonts usually appear to return error code 3.1.3, although the description of that error indicates that it may including other font issues as well. Also, the results of this [Analysis of Acrobat Engineering PDFs with Acrobat Preflight and Apache Preflight] indicate that in some cases non-embedded fonts may produce other error codes. This is all a bit unclear and may need further investigation.

h2. Recommendations

h3. Pre-ingest

* Formulate policy on how to deal with non-embedded, damaged or incomplete fronts.
* Use [Apache Preflight|Apache PDFBox] to check for font errors. Depending on the provenance of the PDFs this may result in many font errors being reported. As the meaning of Preflight's font error codes is not 100% clear, this may not be a viable solution (yet) in operational workflows.

h3. Existing collections

* Use [Apache Preflight|Apache PDFBox] to check for errors. However, this may not be a practical solution yet for the reason listed above.

h2. Example files
* [http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/] - PDF Cabinet of Horrors on OPF Format Corpus
* [http://acroeng.adobe.com/wp/?page_id=101] - Font Testing PDFs on Adobe Acrobat Engineering website