PDF to PDF-A Conversion Pre-Processor

Skip to end of metadata
Go to start of metadata
Title PDF to PDF/A Conversion Pre-Processing
Detailed description It's unlikely that a batch PDF/A conversion utility could be written in just a couple of days:
  • Because the PDF format is complex, the specification document is 700+ pages + extensions
  • Conversion to PDF/A is a difficult problem, here's some of what's required
    • All fonts embedded
    • Well formed XMP metadata that's synched with the document information
    • No external links
    • No embedded objects (except other PDF/A files)
    • Audio and video content is forbidden
    • No encryption
    • No LZW compression
    • and more......
       The approach chosen, characterise the test sample and the Isator PDF/A test corpus:
  • Discover which sample files aren't PDF/A compliant, and why
  • Compare results with characterisation of the test corpus
  • Try to find the 20% of problems that cause 80% of the failures, and see if they can be fixed
  • Create a tool that recognises the "hard" cases and highlights them as such to save effort
     At the British Library we already use Java utilities based on Apache PDF Box and JHOVE 1.6 for PDF characterisation.  These tools can spot password protected and encrypted documents, embedded objects, and external links.  They can also extract document information and XMP Metadata, and seemed to provide a reasonable fit.   In reality the issues were very difficult:
  • Many files fail for multiple reasons
  • Some of the most common reasons
Solution Champion Carl Wilson [email protected]
Corresponding Issue(s)
Tool/code link A link to code on Git hub or a corresponding myExperiment if applicable
Tool Registry Link Apache PDFBox
Evaluation Notes from final presentations at the hackathon:
CO: This problem is at the top of the list of challenges. A complete solution was described in DPC What's New as the CO's dream preservation tool.
DEV: Work in progress characterisation toolset already identifies ~30% issues
DEV: PDFTron identifies all problem cases! Looks to have been very thoroughly tested.
DEV: PDFTron is a $600 tool

Labels:
solution solution Delete
characterisation characterisation Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.