Detect, extract and analyse embedded objects in PDFs

Skip to end of metadata
Go to start of metadata
One line summary Detect and identify embedded objects in PDFs, then where appropriate extract and analyse analyse further
Detailed description The PDF specification is complex, and PDF files can contain other other objects, embedded at the file or page level.  Several Open Source PDF libraries exist, the best known is Apache PDFBox, written in Java. The aim was to use a combination of these libraries.

Well known image types (BMP, JPG, PNG, GIF, and TIFF) are the simplest case as these are natively understood by PDF internal constructs and supported by Open Source PDF libraries.  PDFBox could be used to traverse all of the pages of a PDF document and extract the images from each page, along with the image type and dimensions.  These images could be extracted as a Java image for further analysis (fingerprint generation, characterisation, etc.).

Embedded files are a trickier proposition, PDFBox could spot these at the document level, and attachments made at the page level should still show up in the documents embedded files list.  The complication appears to be that the embedding of files appears to be platform dependent (the PDF specification differentiates between DOS, unix, and MAC formats), so the Java implementation would have to detect the platform and call the appropriate method.

One final complication came when examining the 19C digitised book PDFs.  These are made up of a JPEG2000 image for each page, an image format not natively supported.  PDFBox didn't detect these images.  Further investigation showed that the PDF contained a set of stream objects, one for each page image, and prescribed the JPXFilter (JPEG2000 filter) to render them.  More work is required here.
Solution champion Carl Wilson
Git link  
Group Evaluation Notes
  •  Understanding on problem around characterising embedded images now clear, but solution appears to be challenging. This eeds to be documented here!
Tool (link)  
Issue
Embedded objects in PDFs
Labels:
pdf pdf Delete
objects objects Delete
bmp bmp Delete
jpg jpg Delete
png png Delete
gif gif Delete
tiff tiff Delete
pdfbox pdfbox Delete
jpxfilter jpxfilter Delete
aqua aqua Delete
solution solution Delete
embedded_objects embedded_objects Delete
characterisation characterisation Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.