PDFA Validation tools give different results

Skip to end of metadata
Go to start of metadata

Title
PDF/A Validation tools give different results

Detailed description

There is a disparity between the the results of of the different tools which perform validation of PDF/A files.

The tools tried include

  • PDFTron PDF/A Manager
  • Adobe Preflight
  • Apache PDFBox Preflight

Each tool validates the PDF/A files with different results some deciding that a PDF file is PDF/A compliant and others deciding that the same file is not. The errors which are produced are not necessarily the same between the different tools although they can be.

It would be nice to know which errors are a genuine problem and which are not a preservation issue. e.g. there is an error to do with the width of font 'Widths in embedded font are inconsistent with /Widths entry in the font dictionary.' which is a consistent error across all tools but there are forum posts https://groups.google.com/forum/#!topic/pdfnet-sdk/L2osfwaap98|https://groups.google.com/forum/#!topic/pdfnet-sdk/L2osfwaap98 which suggests that this information is not actually used in the rendering of a PDF or PDF/a and so should not really be included as an error but more of a warning. 

10 sample files and the sample output from PDF/eh? is available on the PDF/eh? github here.

Output from PDFBox Preflight is attached to this page for 307 PDF/A files from ADS Grey Lit library (to download the PDF/A file mentioned in the results - search on the OASIS id field with the filename of the PDF file with extension and suffix removed ie* withamar1-84621_1.pdf* would become withamar1-84621). 

Name Size Creator Creation Date Comment  
XML File pdftron-report-pdfa-1b.xml 2 kB Andrew Jackson Jul 08, 2013 15:11 PDFTron (PDFTron PDF/A Manager V6.000.) output, for comparison.  
XML File pdfbox.preflight.xml 302 kB Jo Gilham Jul 04, 2013 11:43 xml output from PDFBox preflight on 307 PDF/A files mostly created by PDFTron  

Issue champion

Jo Gilham

Other interested parties

Aran Lewis, Peter Cliff, Graham Seaman, Anne Archer, David Tarrant

Possible Solution approaches

  • Using one of the more sensitive validators (i.e. PDFBox) and using an xslt to filter the xml results to mark the file as good bad or ugly... (or in fact red, amber or green) Amber/bad being the list of warning style errors and red being a full on preservation issue.

Context

Details of the institutional context to the Issue.

Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice

Datasets

PDF files from the Archaeology Data Service's grey literature library collection

Middlesex University eprints repository full text documents

Solutions

Visual Analysis of Preflight Output
PDFBox Preflight 2 - Uses and Abuses

Labels:
spruce_london_2 spruce_london_2 Delete
issue issue Delete
conformance conformance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.