Visual Analysis of Preflight Output

Version 9 by Peter Cliff
on Jul 17, 2013 16:29.

compared with
Version 10 by Peter Cliff
on Jul 17, 2013 16:35.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (7)

View Page History
I opted for "PDF" (or "Item") and "Error" for the nodes and one relationship "hasError". In this way we could draw a graph that showed every PDF connected to the errors it displayed. This graph provided quick analysis of the dataset, e.g. showing frequently occurring errors or groups of PDFs that all had the same error (or set of errors).

Here and example graph showing 300ish PDFs from ADS - all of which where in theory valid PDF/A-1b.

The blue dots are the PDFs, the red dots are the errors.


The ring of blue dots without links are all the PDFs Preflight also validated - i.e. they are not linked to any errors.

The bulk of the blue dots in the middle show that there are a couple of error types that crop up in most invalid PDFs and using the Gephi tool you can find out what these errors/PDFs are.

More interesting perhaps are the little clusters around the outside showing small groups of PDFs that only have a couple of errors and are not connected to others. In one instance for the ADS data one of the clusters were all from the same source. This technique could be used to identify systematic faults and try to get them fixed.

Gephi is happy importing this data from CSV files and so I set about creating a suitable CSV from the XML output using a small script written in Java. I needed two files - a file for all of the nodes using one line per error/pdf (this is a simple file that has an ID for each node along with its metadata as key/value pairs where the column name is the key and the cell the value) and a file for each relationship - a simple output giving the type of relationship and then the IDs of the source node and the target node of that relationship. Gephi defines a couple of required headings in the CSV but otherwise any additional columns are just included as attributes of the node or the edge. In this way I could use the short error identifiers to make the graph cleaner to draw, but also read the full error string for errors of interest.

I'm still fairly sure there is more we (DP) can do here. A similar problem to the TIFF2RDF work came up at the BL recently and I'd love to explore this more. Add to this that just about every domain model I've seen in my career in libraries is expressed as some sort of graph (e.g. the PREMIS data model) I'm sure there is more scope for graphs - both for visualisation and metadata storage/querying.

We say a couple of interesting things here:
Just a few thoughts here:

* Thought needs to be given to the presentation, the dataset size, etc. before creating the graph - else you end up with a bit of a mess!
* Different arrangement algorithms gave different degrees of success at visualising.
* Different arrangement algorithms gave different degrees of success - here Gephi's "Force Atlas" worked very well.
* This is easy to do with simple tools.