William Palmer, British Library
|Metric||Metric goal||January 2014 (Baseline)|| July 2014
|TotalRuntime||4:28:00 (hh:mm:ss)|| 13:22:42 (hh:mm:ss) 
|| 231671 
|ReliableAndStableAssessment||TRUE||TRUE|| TRUE 
|NumberOfFailedFiles||0||0|| 12  + 660 Exceptions 
|NumberOfFailedFilesAcceptable||-||TRUE|| TRUE 
| Note this is just DRM and validity checks
|| For this run the tool was renamed Flint and it contained DRM and validity checks, along with a policy check for PDF files
 Note that this runtime includes additional policy checks (see the selected policy validation (PV) results in the table below)
 twelve fewer files were used, as those files were found to crash or hang the JVM (in the policy check stage). This is due to a combination of factors, including the use of a relatively old JVM on the Hadoop cluster, using in-development software and the files potentially being corrupt - further investigation is required. Upgrading the JVMs on the entire cluster may well solve these issues. Removing these files from the input data meant that the test run could be completed successfully.
 the run completed successfully, however, twelve files had to be excluded from the final run (see ). Additionally, for 26184 files, policy validation execution failed. We have found issues with corrupt/broken files in the Govdocs1 corpus due to reporting these issues to Apache PDFBox (see https://issues.apache.org/jira/browse/PDFBOX-1756, https://issues.apache.org/jira/browse/PDFBOX-1757, https://issues.apache.org/jira/browse/PDFBOX-1761, https://issues.apache.org/jira/browse/PDFBOX-1762, https://issues.apache.org/jira/browse/PDFBOX-1769, https://issues.apache.org/jira/browse/PDFBOX-1774 & https://issues.apache.org/jira/browse/PDFBOX-1795)
 The excluded failed files were: 020087.pdf 165487.pdf 289451.pdf 289452.pdf 383325.pdf 299694.pdf 375118.pdf 451665.pdf 451675.pdf 526572.pdf 924677.pdf 870521.pdf
 660 Exceptions were present in the output. 626 were due to a failure after the text extraction step of PDF validation (as no output file was present), 31 were null pointer exceptions, with 3 other misc exceptions. Due to the nature of the dataset, these sort of errors are expected and given the small number of issues, they are fixable. The files that failed are identifiable so testing could be run against those in particular.
Selected results from Flint output:
|Test|| Number of files
| DRM detected
||NOTE: checks are not currently made against print/copy restrictions etc|
| Well formed failure
| Policy Validation execution failure
|| Where policy validation failed (and no PV results are output)
| PV encryption check (present)
| PV damaged fonts present
| PV embedded files present
| PV multimedia present
As Flint provides results as a table, different policies can be checked against the Flint output as various times. For example - whether or not files contain multimedia may become an issue and this can then be determined from the results.
For some evaluation points it makes most sense to a textual description/explanation
Please include a note about goals-objectives omitted, and why.
Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)
Source code: [Jan 14] https://github.com/bl-dpt/drmlint/commit/ecca9a28fe095bed6b770e59046d17d7e595fd09
[July 2014] https://github.com/openplanets/flint
Could be such things as identified issues, workarounds, data preparation, if not already included above
Files are kept in HDFS, it's possible that sequence files might help in this instance as the files themselves are relatively small.
There are issues with the JVM but it was not possible to easily upgrade the JVM on the cluster
Some really broken files crashed the JVM/libraries and it is not possible to protect against JVM crashes. There are known to be issues with some of the input files in Govdocs1 (see above issue reports).
Testing with the policy checks takes approximately three times as long as the basic checks. Extrapolating from the test dataset for this evaluation, it would be possible to process 1TB of PDF files, with policy checks, in less than 4.5 days on our Hadoop cluster. This is acceptable for using on a routine basis, should that be necessary. Although PDF files can be relatively small, Flint's execution speed does not appear to suffer from the small files problem - its processing is CPU bound, not I/O bound, as evidenced by the 9.55GB/h processing speed (i.e. 2.7MB/s read speed).