View Source

h2. Description
PDFs may contain file attachments. There are two ways to include an attachment in a PDF:

# Page-level attachments which use a _File Attachment Annotation_ (section 12.5.6.15 of [ISO32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf])
# Document-level attachments which are defined by the _EmbeddedFiles_ entry in the document’s _name_ dictionary (section 7.7.4 of [ISO32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf])

Both 1. and 2. are really just references to the actual file attachment data, which are stored as an _Embedded File Stream_ in the document in both cases. However, an _Embedded File Stream_ can also be used to represent multimedia content (see also [this blog post on embedded files in PDF|http://www.openplanetsfoundation.org/blogs/2013-01-09-what-do-we-mean-embedded-files-pdf]), so by itself this cannot be used to identify a file attachment.

h2. Risks
Attachment can have any format, so long-term accessibility may be at risk. Attached malicious software can be a security risk.

h2. Assessment
The following table shows the relevant output of _Apache Preflight_ (part of [Apache PDFBox]) for PDFs with file attachments. Results obtained with _Preflight_ 2.0.0:

|*Reference file*|*Description*|*Error Code(s)*|*Details*|
|[fileAttachment.pdf|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf]|Contains document-level file attachment that is defined using _EmbeddedFiles_ entry in the document’s name dictionary|1.2.9; 1.4.7|Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary; Trailer Syntax error, EmbeddedFile entry is present in the Names dictionary|
|[fileAttachment_fileAttachmentAnnotation.pdf|http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment_fileAttachmentAnnotation.pdf]|Contains page-level file attachment that is defined using a _File Attachment Annotation_|1.2.9; 5.2.1|Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary; Forbidden field in an annotation definition, The subtype isn't authorized : FileAttachment|
|[PDF___FileAttachment.pdf|http://acroeng.adobe.com/Test_Files/file_attachments//PDF___FileAttachment.pdf]|From File Attachment Testing on Adobe Acrobat Engineering website|1.4.7|Trailer Syntax error, EmbeddedFile entry is present in the Names dictionary|
|[non_ACRO___FileAttachment.pdf|http://acroeng.adobe.com/Test_Files/file_attachments//non_ACRO___FileAttachment.pdf]|From File Attachment Testing on Adobe Acrobat Engineering website|1.4.7|Trailer Syntax error, EmbeddedFile entry is present in the Names dictionary|
|[non_PDF_ACRO___FileAttachment.pdf|http://acroeng.adobe.com/Test_Files/file_attachments//non_PDF_ACRO___FileAttachment.pdf]|From File Attachment Testing on Adobe Acrobat Engineering website|1.4.7|Trailer Syntax error, EmbeddedFile entry is present in the Names dictionary|

h2. Notes

h3. Error 1.2.9 may also indicate multimedia content
Error code 1.2.9 ('EmbeddedFile entry is present in a FileSpecification dictionary') is also reported for PDFs that contain [Multimedia content] that is represented as _Embedded File Streams_ (see above).

h3. Page-level and document-level attachments result in different errors
Also note from the above results that a document-level attachment produces error 1.4.7 (Trailer Syntax error, EmbeddedFile entry is present in the Names dictionary), whereas a page-level file attachment will result in error 5.2.1 ('Forbidden field in an annotation definition, The subtype isn't authorized : FileAttachment'). For the second case the error message as a whole should be taken into account, as 5.2.1 is a generic error code that encompasses a number of different annotation types.

h3. Preflight doesn't report _Embedded File Stream_ for Acrobat Engineering PDFs
The table above shows that error 1.2.9 isn't reported for the Acrobat Engineering PDFs, even though a manual check in a hex editor confirms that these files do contain embedded file streams. This is most likely a bug in Preflight (reported [here|https://issues.apache.org/jira/browse/PDFBOX-1758]).

h2. Recommendations

h3. Pre-ingest

* Formulate policy on how to deal with file attachments, and the long-term accessibility requirements of attached files.
* Use [Apache Preflight|Apache PDFBox] to establish if files contain file attachments.
* If attached files are to remain accessible in the long term, a possible option would be to extract attached files before ingest, and ingest the attachments as supplementary file objects to the PDF.

h3. Existing collections

* Use [Apache Preflight|Apache PDFBox] to detect files with file attachments in collection.

h2. Example files
* [http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/] - PDF Cabinet of Horrors on OPF Format Corpus
* [http://acroeng.adobe.com/wp/?page_id=276 File Attachment Testing on Adobe Acrobat Engineering website]

h2. References

* [Van der Knijff, J.M. What do we mean by "embedded" files in PDF?|http://www.openplanetsfoundation.org/blogs/2013-01-09-what-do-we-mean-embedded-files-pdf] - explains use of _Embedded File Streams_ for both file attachments and multimedia.