PDF Characterisation Tool

Skip to end of metadata
Go to start of metadata
One line summary Java program to characterise PDF files, looking for preservation concerns.                                                                                                                  
Detailed description Currently checks for the following:

Is the document encrypted?
Can the document be printed?
Can the document be amended?
Number of pages.
Embedded JavaScript.
External links, and extracts the URIs.

Embedded fonts proved to be challenging, here's a summary as to why:

Fonts are used in 3 places
  • the documents pages.
  • the embedded AcroForm (if present).
  • the form fields on pages.
    Therefore all of these areas have to be crawled to extract the fonts used.

    The PDFBox API does not provide and easy method to detect whether a font is embedded within the PDF documents (the iText, and JPod APIs both supply methods that do this, this should allow implementation and cross automated testing (Jpod vs. iText).

    There is a final twist to the puzzle at this point.  Detecting that a font is embedded isn't enough, the font may be corrupt or incomplete.
  • The embedded font may be corrupt, the font itself should be parsed to ensure that it is indeed a legal font (FontBox could be used for this).
  • PDF allows the embedding of font subsets so a check needs to be made that all of the characters in the document are contained in the embedded font.
Solution champion Carl Wilson
Git link  
Group Evaluation Notes
  • Embedded fonts issue exploration. Solution partial, but interesting discoveries in the journey. This needs to be documented here!
  • Some extraction working. Requires further experimentation
  • Roger suggested another tool with potential in interpreting embedded JPEG2000: itext
Tool (link) http://itextpdf.com/
Issue
Unknown PDF characteristics
Labels:
pdf pdf Delete
characterise characterise Delete
pdfbox pdfbox Delete
api api Delete
fonts fonts Delete
issue issue Delete
acroform acroform Delete
embedded embedded Delete
jpeg2000 jpeg2000 Delete
java java Delete
extraction extraction Delete
itext itext Delete
jpod jpod Delete
obsolescence obsolescence Delete
aqua aqua Delete
solution solution Delete
characterisation characterisation Delete
validation validation Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.