Skip to end of metadata
Go to start of metadata
This scenario focuses on the identification of preservation risks in
PDF files in order to ensure that rendering services can continue to be supported.

Dataset:

Title
KB Open Access Journals PDFs
Description Large (approx. 380,000 files) dataset of PDF files from open access journals.
Licensing TBC
Owner KB
Dataset Location TBC (Dataset currently in preparation)
Collection expert Johan van der Knijff (KB)
Issues brainstorm
List of Issues IS11 PDF files may face preservation risks


Issue:

Title
IS11 PDF files may face preservation risks
Detailed description Problem: the PDF document standard contains various features that pose a direct threat to the long-term accessibility of PDF files. Examples are: password protection, external dependencies, non-embedded fonts, and the use of filter encodings that are subject to intellectual property constraints.  In addition, features such as print and copy protection, and the use of fonts that cannot be legally embedded may make future migrations impossible. Repositories may have policies that specify whether materials that contain such features should be accepted. However, the implementation of such policies requires that PDF files can be automatically checked for these features before ingest. In addition, such checks may be needed for preservation watch as well.
The BL (led by Carl Wilson)  is currently conducting some work to identify risks related to PDF files and will make this information available shortly.
Scalability Challenge
PDFs are in the top 5 of file formats held by the BL and are numerous. The PDF format is extremely complex.
Issue champion Maureen Pennock (BL)
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches Solutions that address these issues might vary depending on the particular dataset and context. Identification of risks would be a likely first stage. Mitigation or identified risks may also be appropriate in some cases. This would however require careful preservation planning to select the most appropriate technique (eg. alternative rendering tool, make available new fonts, record external dependencies, migration from PDF format to (fixed) PDF format) before taking action. QA of any migrations will be essential.
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets TBC
Solutions SO25 Rosetta v3.0 Implementation Integrated with DROID 6

Evaluation

Objectives scaleability, coverage, preciseness, automation
Success criteria
  • Being able to identify all potentially risky features in PDF files (effectively everything that is not allowed in PDF/A(-1?)
  • Ability to translate low-level control policies to machine-readible business rules
  • Ability to evaluate set of all potentially risky features against these business rules (e.g. using Schematron)
  • Solution should be sufficiently scalable (performance for profiling large data sets / stability for large and complex PDFs)
Automatic measures TBC
Manual assessment Which specific features are we able to identify reliably and successfully (e.g. encryption, non-embedded fonts, external references, multimedia content)
Actual evaluations links to acutual evaluations of this Issue/Scenario

Solutions:

TBC, a possible approach would be to use the Apache Preflight PDF/A validator (part of PDFBox) to identify all potential risks, and then evaluate the output against a set of business rules that correspond to low-level (control) policies. This could be done with Schematron (requires development of XML output handler for Preflight!), resulting in an approach similar to the JPEG 2000 / jpylyzer work. See also:

http://www.openplanetsfoundation.org/comment/385#comment-385

Labels:
scenario scenario Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.