Johan van der Knijff
William Palmer, BL (william (.) palmer (@)) bl (.) uk)
Digital repositories typically hold large numbers of electronic documents from various sources. Common document formats such as PDF and EPUB include features that are potential risks for long-term accessibility and preservation. Hence, in order to sustainably manage their collections, institutions may want to identify specific preservation risks, either at ingest or at some later stage.
- We need to be able to identify "preservation risks" for a given document. These risks include, but are not limited to:
- password protection
- print protection
- copy protection
- other DRM
- embedded proprietary content such as commercial fonts JvdK: I think commercial fonts are only a problem if they are not embedded??
- missing or damaged fonts
- multimedia content
- other external dependencies
- We need to be able to assess legacy files and deal with them appropriately
- We need to be able to assess files prior to ingest and deal with them appropriately
- We would ideally do 2 & 3 on the basis of some machine readable policy
Create experiments as child pages and they should appear automatically here
Characterisation of ebook formats to identify DRM, etc. as per BL ingest policy (PC)
Data: No. Awaiting test data from publishers. Will not be public.
Issues: See data.
Wrap tool for use in Rosetta & execute over some content (OK)
Issues: Not yet!
TBC, for PDF a possible approach would be to use the Apache Preflight PDF/A validator (part of PDFBox) to identify all potential risks, and then evaluate the output against a set of business rules that correspond to low-level (control) policies. This could be done with Schematron (requires development of XML output handler for Preflight!), resulting in an approach similar to the JPEG 2000 / jpylyzer work. See also:
For EPUB something similar could be done using the EpubCheck tool.
Also this policy validation is something SCAPE's SCOUT should/could deal within.