Skip to end of metadata
Go to start of metadata


William Palmer, British Library

Evaluation points

Assessment of measurable points
Metric Metric goal January 2014 (Baseline) July 2014
TotalRuntime   4:28:00 (hh:mm:ss) 13:22:42 (hh:mm:ss) [0]
231671 [1]
NumberOfObjectsPerHour   51869.32836 17316
ThroughputGbytesPerHour   28.61674477 9.55
ReliableAndStableAssessment TRUE TRUE TRUE [2]
NumberOfFailedFiles 0 0 12 [2] [3]+ 660 Exceptions [4]
NumberOfFailedFilesAcceptable - TRUE TRUE [1]
    Note this is just DRM and validity checks
For this run the tool was renamed Flint and it contained DRM and validity checks, along with a policy check for PDF files

[0] Note that this runtime includes additional policy checks (see the selected policy validation (PV) results in the table below)

[1] twelve fewer files were used, as those files were found to crash or hang the JVM (in the policy check stage).  This is due to a combination of factors, including the use of a relatively old JVM on the Hadoop cluster, using in-development software and the files potentially being corrupt - further investigation is required.  Upgrading the JVMs on the entire cluster may well solve these issues.  Removing these files from the input data meant that the test run could be completed successfully.

[2] the run completed successfully, however, twelve files had to be excluded from the final run (see [1]). Additionally, for 26184 files, policy validation execution failed.  We have found issues with corrupt/broken files in the Govdocs1 corpus due to reporting these issues to Apache PDFBox (see,,,,, & 

[3] The excluded failed files were: 020087.pdf 165487.pdf 289451.pdf 289452.pdf 383325.pdf 299694.pdf 375118.pdf 451665.pdf 451675.pdf 526572.pdf 924677.pdf 870521.pdf

[4] 660 Exceptions were present in the output.  626 were due to a failure after the text extraction step of PDF validation (as no output file was present), 31 were null pointer exceptions, with 3 other misc exceptions.  Due to the nature of the dataset, these sort of errors are expected and given the small number of issues, they are fixable.  The files that failed are identifiable so testing could be run against those in particular.

Selected results from Flint output:

Test Number of files
DRM detected
NOTE: checks are not currently made against print/copy restrictions etc
Well formed failure
Policy Validation execution failure
Where policy validation failed (and no PV results are output)
PV encryption check (present)
PV damaged fonts present
PV javascript present
PV embedded files present
PV multimedia present

As Flint provides results as a table, different policies can be checked against the Flint output as various times.  For example - whether or not files contain multimedia may become an issue and this can then be determined from the results.

Assessment of non-measurable points

For some evaluation points it makes most sense to a textual description/explanation

Please include a note about goals-objectives omitted, and why.

Technical details

Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)


Source code: [Jan 14]

[July 2014]

Evaluation notes

Could be such things as identified issues, workarounds, data preparation, if not already included above

Files are kept in HDFS, it's possible that sequence files might help in this instance as the files themselves are relatively small.

There are issues with the JVM but it was not possible to easily upgrade the JVM on the cluster

Some really broken files crashed the JVM/libraries and it is not possible to protect against JVM crashes. There are known to be issues with some of the input files in Govdocs1 (see above issue reports).


Testing with the policy checks takes approximately three times as long as the basic checks.  Extrapolating from the test dataset for this evaluation, it would be possible to process 1TB of PDF files, with policy checks, in less than 4.5 days on our Hadoop cluster.  This is acceptable for using on a routine basis, should that be necessary.  Although PDF files can be relatively small, Flint's execution speed does not appear to suffer from the small files problem - its processing is CPU bound, not I/O bound, as evidenced by the 9.55GB/h processing speed (i.e. 2.7MB/s read speed).

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.