PDFBox Preflight 2 - Uses and Abuses

Skip to end of metadata
Go to start of metadata

PDFBox Preflight 2 - Uses and Abuses

Detailed description
PDFBox Preflight v2.0 is a tool developed under the Apache Foundation and forms part of the PDFBox PDF Toolkit. At a previous event we enhanced Preflight's output to generate XML - a code change that is now in the latest (as yet unreleased) version of the tool. During that event a couple of things became apparent:

  • Preflight is thorough and unforgiving (as it should be) giving lots of errors and we're not clear what matters.
  • Preflight error messages are not very user friendly (probably a reflection of the PDF spec).
  • Failure to validate using Preflight does not mean a PDF is wholly useless or unworthy of preservation - in nearly all cases invalid PDFs are renderable.

As such we started down a journey to try to understand PDFs better, see if PDF/A meets the needs of digital preservation (failure to comply to PDF/A-1b might not indicate a significant preservation risk) and make Preflight usable.

At the previous event an XSLT filter (that represented a policy) was created where each of the validation problems identified by Preflight could be flagged as either "fail", "warn" or "ignore". In this way a knowledgeable repository owner could refine the validation - ignoring font errors for example and during this event Dave Tarrant investigated how this XSLT could be integrated with repository software - preliminary work for the OR2013 Developer Challenge.

We also created a very simple (and buggy ) wrapper for Preflight that presented a validation service as a simple HTTP POST Web Service. Details of this are available on GitHub.

Finally we spent a fair amount of time running Preflight against the PDFs provided by ADS and Middlesex and put the validation results on Github so people could continue the fight once we'd all gone home.

Solution Champion
Peter Cliff, Graham Seaman, Dave Tarrant

Corresponding Issue(s)
PDFA Validation tools give different results

Tool/code link
https://github.com/openplanets/pdfeh/
https://github.com/petecliff/preflight-server/

Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.

Evaluation
What did we learn? That PDF and PDF/A validation is hard - the PDF spec. is complicated! Because of this, validators often give mixed messages - PDFTron validated a number of PDF/As that Preflight did not. However we all lacked enough PDF knowledge to know which was right or if the problems Preflight was highlighting really mattered. Identifying that would be a great way to go. However, it will be difficult (isn't it what the PDF/A creators have been wrestling with for ages?).

Subsequently I've found a couple of PDFs within the BL that did not validate using an older version of JHOVE (unknown) but do validate on the latest (1.10 with 1.7 pdf-hul).

This all suggests there is a raft of work to be done here to keep PDFs alive. We may also want to look further at pragmatic preservation - if it renders is that is good enough? What are the implications of that kind of decision?

Finally, shouldn't we just use EPUB3 instead?

Labels:
spruce_london_2 spruce_london_2 Delete
solution solution Delete
validation validation Delete
pdf pdf Delete
nodejs nodejs Delete
server server Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jul 26, 2013

    In addition to this, I think it's worth adding that Preflight may not be as thorough (yet!) as is suggested here; see blog post below for an explanation:

    http://www.openplanetsfoundation.org/blogs/2013-07-25-identification-pdf-preservation-risks-sequel