PDF Characterisation Tool

Version 5 by Carl Wilson
on Apr 13, 2011 15:26.

compared with
Current by Paul Wheatley
on Aug 10, 2011 15:56.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (17)

View Page History
Can the document be printed? \\
Can the document be amended? \\
Number of pages. \\
Embedded JavaScript.\\
Embedded JavaScript. \\
External links, and extracts the URIs. \\
\\
Embedded fonts proved to be challenging, here's a summary as to why: \\
\\
Fonts are used in 3 places \\
* the documents pages.
* the embedded AcroForm (if present).
* the form fields on pages.\\
Therefore all of these areas have to be crawled to extract the fonts used. \\
\\
The PDFBox API does not provide and easy method to detect whether a font is embedded within the PDF documents (the iText, and JPod APIs both supply methods that do this, this should allow implementation and cross automated testing (Jpod vs. iText). \\
\\
There is a final twist to the puzzle at this point.  Detecting that a font is embedded isn't enough, the font may be corrupt or incomplete. \\
* The embedded font may be corrupt, the font itself should be parsed to ensure that it is indeed a legal font (FontBox could be used for this).
* PDF allows the embedding of font subsets so a check needs to be made that all of the characters in the document are contained in the embedded font. |
| *Solution champion* | Carl Wilson |
* Some extraction working. Requires further experimentation
* Roger suggested another tool with potential in interpreting embedded JPEG2000: itext |
| *Tool* (link) | [http://itextpdf.com/] |
| *Issue*\\ | [AQuA:Unknown PDF characteristics]\\ |