Title
PDF/A Validation tools give different results
Detailed description
There is a disparity between the the results of of the different tools which perform validation of PDF/A files.
The tools tried include
- PDFTron PDF/A Manager
- Adobe Preflight
- Apache PDFBox Preflight
Each tool validates the PDF/A files with different results some deciding that a PDF file is PDF/A compliant and others deciding that the same file is not. The errors which are produced are not necessarily the same between the different tools although they can be.
It would be nice to know which errors are a genuine problem and which are not a preservation issue. e.g. there is an error to do with the width of font 'Widths in embedded font are inconsistent with /Widths entry in the font dictionary.' which is a consistent error across all tools but there are forum posts https://groups.google.com/forum/#!topic/pdfnet-sdk/L2osfwaap98|https://groups.google.com/forum/#!topic/pdfnet-sdk/L2osfwaap98 which suggests that this information is not actually used in the rendering of a PDF or PDF/a and so should not really be included as an error but more of a warning.
10 sample files and the sample output from PDF/eh? is available on the PDF/eh? github here.
Output from PDFBox Preflight is attached to this page for 307 PDF/A files from ADS Grey Lit library (to download the PDF/A file mentioned in the results - search on the OASIS id field with the filename of the PDF file with extension and suffix removed ie* withamar1-84621_1.pdf* would become withamar1-84621).
Issue champion
Other interested parties
Aran Lewis,
Peter Cliff, Graham Seaman,
Anne Archer,
David Tarrant
Possible Solution approaches
- Using one of the more sensitive validators (i.e. PDFBox) and using an xslt to filter the xml results to mark the file as good bad or ugly... (or in fact red, amber or green) Amber/bad being the list of warning style errors and red being a full on preservation issue.
Context
Details of the institutional context to the Issue.
Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets
PDF files from the Archaeology Data Service's grey literature library collection
Middlesex University eprints repository full text documents
Solutions
Visual Analysis of Preflight Output
PDFBox Preflight 2 - Uses and Abuses