Problem: the PDF document standard contains various features that pose a direct threat to the long-term accessibility of PDF files. Examples are: password protection, external dependencies, non-embedded fonts, and the use of filter encodings that are subject to intellectual property constraints. In addition, features such as print and copy protection, and the use of fonts that cannot be legally embedded may make future migrations impossible. Repositories may have policies that specify whether materials that contain such features should be accepted. However, the implementation of such policies requires that PDF files can be automatically checked for these features before ingest. In addition, such checks may be needed for preservation watch as well.
The BL (led by Carl Wilson) is currently conducting some work to identify risks related to PDF files and will make this information available shortly.
PDFs are in the top 5 of file formats held by the BL and are numerous. The PDF format is extremely complex.
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg:Schlarb Sven(ONB)
Possible Solution approaches
Solutions that address these issues might vary depending on the particular dataset and context. Identification of risks would be a likely first stage. Mitigation or identified risks may also be appropriate in some cases. This would however require careful preservation planning to select the most appropriate technique (eg. alternative rendering tool, make available new fonts, record external dependencies, migration from PDF format to (fixed) PDF format) before taking action. QA of any migrations will be essential.
Details of the institutional context to the Issue. (May be expanded at a later date)
Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Being able to identify all potentially risky features in PDF files (effectively everything that is not allowed in PDF/A(-1?)
Ability to translate low-level control policies to machine-readible business rules
Ability to evaluate set of all potentially risky features against these business rules (e.g. using Schematron)
Solution should be sufficiently scalable (performance for profiling large data sets / stability for large and complex PDFs)
Which specific features are we able to identify reliably and successfully (e.g. encryption, non-embedded fonts, external references, multimedia content)
links to acutual evaluations of this Issue/Scenario
TBC, a possible approach would be to use the Apache Preflight PDF/A validator (part of PDFBox) to identify all potential risks, and then evaluate the output against a set of business rules that correspond to low-level (control) policies. This could be done with Schematron (requires development of XML output handler for Preflight!), resulting in an approach similar to the JPEG 2000 / jpylyzer work. See also: