One line summary | Detect and identify embedded objects in PDFs, then where appropriate extract and analyse analyse further |
Detailed description | The PDF specification is complex, and PDF files can contain other other objects, embedded at the file or page level. Several Open Source PDF libraries exist, the best known is Apache PDFBox, written in Java. The aim was to use a combination of these libraries. Well known image types (BMP, JPG, PNG, GIF, and TIFF) are the simplest case as these are natively understood by PDF internal constructs and supported by Open Source PDF libraries. PDFBox could be used to traverse all of the pages of a PDF document and extract the images from each page, along with the image type and dimensions. These images could be extracted as a Java image for further analysis (fingerprint generation, characterisation, etc.). Embedded files are a trickier proposition, PDFBox could spot these at the document level, and attachments made at the page level should still show up in the documents embedded files list. The complication appears to be that the embedding of files appears to be platform dependent (the PDF specification differentiates between DOS, unix, and MAC formats), so the Java implementation would have to detect the platform and call the appropriate method. One final complication came when examining the 19C digitised book PDFs. These are made up of a JPEG2000 image for each page, an image format not natively supported. PDFBox didn't detect these images. Further investigation showed that the PDF contained a set of stream objects, one for each page image, and prescribed the JPXFilter (JPEG2000 filter) to render them. More work is required here. |
Solution champion | Carl Wilson |
Git link | |
Group Evaluation Notes |
|
Tool (link) | |
Issue |
Embedded objects in PDFs |