View Source

| *Title* \\ | IS11 PDF files may face preservation risks |
| *Detailed description* | Problem: the PDF document standard contains various features that pose a direct threat to the long-term accessibility of PDF files. Examples are: password protection, external dependencies, non-embedded fonts, and the use of filter encodings that are subject to intellectual property constraints.  In addition, features such as print and copy protection, and the use of fonts that cannot be legally embedded may make future migrations impossible. Repositories may have policies that specify whether materials that contain such features should be accepted. However, the implementation of such policies requires that PDF files can be automatically checked for these features before ingest. In addition, such checks may be needed for preservation watch as well. \\
The BL (led by Carl Wilson)  is currently conducting some work to identify risks related to PDF files and will make this information available shortly. \\ |
| *Scalability Challenge* \\ | PDFs are in the top 5 of file formats held by the BL and are numerous. The PDF format is extremely complex. \\ |
| *[Issue champion|SP:Responsibilities of the roles described on these pages]* | [Maureen Pennock|] (BL) |
| *Other interested parties* \\ | _Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg:_ [Schlarb Sven|] _(ONB)_ |
| *Possible Solution approaches* | Solutions that address these issues might vary depending on the particular dataset and context. Identification of risks would be a likely first stage. Mitigation or identified risks may also be appropriate in some cases. This would however require careful preservation planning to select the most appropriate technique (eg. alternative rendering tool, make available new fonts, record external dependencies, migration from PDF format to (fixed) PDF format) before taking action. QA of any migrations will be essential. |
| *Context* | _Details of the institutional context to the Issue. (May be expanded at a later date)_ \\ |
| *Lessons Learned* | _Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)_ \\ |
| *Training Needs* | _Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP._ \\ |
| *Datasets* | _TBC_ \\ |
| *Solutions* | [SO25 Rosetta v3.0 Implementation Integrated with DROID 6|] \\ |

h1. Evaluation

| *Objectives* | scaleability, coverage, preciseness, automation |
| *Success criteria* | * Being able to identify all potentially risky features in PDF files (effectively everything that is not allowed in PDF/A(-1?)
* Ability to translate low-level control policies to machine-readible business rules
* Ability to evaluate set of all potentially risky features against these business rules (e.g. using Schematron)
* Solution should be sufficiently scalable (performance for profiling large data sets / stability for large and complex PDFs) |
| *Automatic measures* | TBC |
| *Manual assessment* | Which specific features are we able to identify reliably and successfully (e.g. encryption, non-embedded fonts, external references, multimedia content) \\ |
| *Actual evaluations* | links to acutual evaluations of this Issue/Scenario |