View Source

| *Title* | PDF to PDF/A Conversion Pre-Processing |
| *Detailed description* | It's unlikely that a batch PDF/A conversion utility could be written in just a couple of days: \\
* Because the PDF format is complex, the specification document is 700\+ pages + extensions
* Conversion to PDF/A is a difficult problem, here's some of what's required
** All fonts embedded
** Well formed XMP metadata that's synched with the document information
** No external links
** No embedded objects (except other PDF/A files)
** Audio and video content is forbidden
** No encryption
** No LZW compression
** and more......\\
 The approach chosen, characterise the test sample and the Isator PDF/A test corpus: \\
* Discover which sample files aren't PDF/A compliant, and why
* Compare results with characterisation of the test corpus
* Try to find the 20% of problems that cause 80% of the failures, and see if they can be fixed
* Create a tool that recognises the "hard" cases and highlights them as such to save effort\\
 At the British Library we already use Java utilities based on Apache PDF Box and JHOVE 1.6 for PDF characterisation.  These tools can spot password protected and encrypted documents, embedded objects, and external links.  They can also extract document information and XMP Metadata, and seemed to provide a reasonable fit.   In reality the issues were very difficult: \\
* Many files fail for multiple reasons
* Some of the most common reasons |
| *Solution Champion* | Carl Wilson [mailto:[email protected]] |
| *Corresponding Issue(s)* | * _[PDF to PDF-A conversion|REQ:PDF to PDF-A conversion]_ |
| *Tool/code link* | _A link to code on Git hub or a corresponding_ _[myExperiment|]_ _if applicable_ |
| *[Tool Registry Link|]* | _[TR:Apache PDFBox]_ \\ |
| *Evaluation* |Notes from final presentations at the hackathon:\\
CO: This problem is at the top of the list of challenges. A complete solution was described in DPC What's New as the CO's dream preservation tool.\\
DEV: Work in progress characterisation toolset already identifies \~30% issues\\
DEV: PDFTron identifies all problem cases\! Looks to have been very thoroughly tested.\\
DEV: PDFTron is a $600 tool\\
\\ |