Visual Analysis of Preflight Output
During the mashup we ran Apache's PDFBox Preflight over 4000+ PDFs sourced from ADS and Middlesex. I was curious about how we might make sense of the output and (having read all about it) and inspired by my previous efforts with TIFFs I decided to give Gephi a go.
Starting with the XML output from Preflight I needed a way to describe this data in terms of nodes and relationships (edges).
I opted for "PDF" (or "Item") and "Error" for the nodes and one relationship "hasError". In this way we could draw a graph that showed every PDF connected to the errors it displayed. This graph provided quick analysis of the dataset, e.g. showing frequently occurring errors or groups of PDFs that all had the same error (or set of errors).
Here and example graph showing 300ish PDFs from ADS - all of which where in theory valid PDF/A-1b.
The blue dots are the PDFs, the red dots are the errors.
The ring of blue dots without links are all the PDFs Preflight also validated - i.e. they are not linked to any errors.
The bulk of the blue dots in the middle show that there are a couple of error types that crop up in most invalid PDFs and using the Gephi tool you can find out what these errors/PDFs are.
More interesting perhaps are the little clusters around the outside showing small groups of PDFs that only have a couple of errors and are not connected to others. In one instance for the ADS data one of the clusters were all from the same source. This technique could be used to identify systematic faults and try to get them fixed.
Gephi is happy importing this data from CSV files and so I set about creating a suitable CSV from the XML output using a small script written in Java. I needed two files - a file for all of the nodes using one line per error/pdf (this is a simple file that has an ID for each node along with its metadata as key/value pairs where the column name is the key and the cell the value) and a file for each relationship - a simple output giving the type of relationship and then the IDs of the source node and the target node of that relationship. Gephi defines a couple of required headings in the CSV but otherwise any additional columns are just included as attributes of the node or the edge. In this way I could use the short error identifiers to make the graph cleaner to draw, but also read the full error string for errors of interest.
The code used to make this XML to CSV transformation is available on GitHub. Jo also discovered that the Preflight XML could be loaded directly into Excel, but I think this was just one file at a time.
What is nice about all of this is you only need a couple of very simple CSVs to create a graph - that is something anyone could do, not just a developer, and there is plenty of scope to explore these ideas further (and I intend to do so!).
Peter Cliff, firstname.lastname@example.org
PDFA Validation tools give different results
I'm still fairly sure there is more we (DP) can do here. A similar problem to the TIFF2RDF work came up at the BL recently and I'd love to explore this more. Add to this that just about every domain model I've seen in my career in libraries is expressed as some sort of graph (e.g. the PREMIS data model) I'm sure there is more scope for graphs - both for visualisation and metadata storage/querying.
Just a few thoughts here:
- Thought needs to be given to the presentation, the dataset size, etc. before creating the graph - else you end up with a bit of a mess!
- Different arrangement algorithms gave different degrees of success - here Gephi's "Force Atlas" worked very well.
- This is easy to do with simple tools.