Skip to end of metadata
Go to start of metadata




Clemens Neudecker
Johan van der Knijff
William Palmer, BL (william (.) palmer (@)) bl (.) uk)

User Story

Digital repositories typically hold large numbers of electronic documents from various sources. Common document formats such as PDF and EPUB include features that are potential risks for long-term accessibility and preservation. Hence, in order to sustainably manage their collections, institutions may want to identify specific preservation risks, either at ingest or at some later stage.

User Requirements/Components

  1. We need to be able to identify "preservation risks" for a given document. These risks include, but are not limited to:
    1. password protection
    2. print protection
    3. copy protection
    4. other DRM
    5. embedded proprietary content such as commercial fonts JvdK: I think commercial fonts are only a problem if they are not embedded??
    6. missing or damaged fonts
    7. JavaScript (which may present several security risks)
    8. multimedia content
    9. other external dependencies
  2. We need to be able to assess legacy files and deal with them appropriately
  3. We need to be able to assess files prior to ingest and deal with them appropriately
  4. We would ideally do 2 & 3 on the basis of some machine readable policy


Create experiments as child pages and they should appear automatically here

Characterisation of ebook formats to identify DRM, etc. as per BL ingest policy (PC)
Data: No. Awaiting test data from publishers. Will not be public.
Workflow: No.
Issues: See data.

Wrap tool for use in Rosetta & execute over some content (OK)
Data: TBD
Workflow: No
Issues: Not yet!

Developer Notes

TBC, for PDF a possible approach would be to use the Apache Preflight PDF/A validator (part of PDFBox) to identify all potential risks, and then evaluate the output against a set of business rules that correspond to low-level (control) policies. This could be done with Schematron (requires development of XML output handler for Preflight!), resulting in an approach similar to the JPEG 2000 / jpylyzer work. See also:

For EPUB something similar could be done using the EpubCheck tool.

Also this policy validation is something SCAPE's SCOUT should/could deal within.

Related Documents

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.