Skip to end of metadata
Go to start of metadata

Evaluator(s)

William Palmer, British Library

Evaluation points

Assessment of measurable points
Metric Metric goal January 2014 (Baseline) July 2014
TotalRuntime   4:28:00 (hh:mm:ss) 13:22:42 (hh:mm:ss) [0]
TotalObjects
  231683
231671 [1]
NumberOfObjectsPerHour   51869.32836 17316
ThroughputGbytesPerHour   28.61674477 9.55
ReliableAndStableAssessment TRUE TRUE TRUE [2]
NumberOfFailedFiles 0 0 12 [2] [3]+ 660 Exceptions [4]
NumberOfFailedFilesAcceptable - TRUE TRUE [1]
    Note this is just DRM and validity checks
For this run the tool was renamed Flint and it contained DRM and validity checks, along with a policy check for PDF files

[0] Note that this runtime includes additional policy checks (see the selected policy validation (PV) results in the table below)

[1] twelve fewer files were used, as those files were found to crash or hang the JVM (in the policy check stage).  This is due to a combination of factors, including the use of a relatively old JVM on the Hadoop cluster, using in-development software and the files potentially being corrupt - further investigation is required.  Upgrading the JVMs on the entire cluster may well solve these issues.  Removing these files from the input data meant that the test run could be completed successfully.

[2] the run completed successfully, however, twelve files had to be excluded from the final run (see [1]). Additionally, for 26184 files, policy validation execution failed.  We have found issues with corrupt/broken files in the Govdocs1 corpus due to reporting these issues to Apache PDFBox (see https://issues.apache.org/jira/browse/PDFBOX-1756, https://issues.apache.org/jira/browse/PDFBOX-1757, https://issues.apache.org/jira/browse/PDFBOX-1761, https://issues.apache.org/jira/browse/PDFBOX-1762, https://issues.apache.org/jira/browse/PDFBOX-1769, https://issues.apache.org/jira/browse/PDFBOX-1774 & https://issues.apache.org/jira/browse/PDFBOX-1795) 

[3] The excluded failed files were: 020087.pdf 165487.pdf 289451.pdf 289452.pdf 383325.pdf 299694.pdf 375118.pdf 451665.pdf 451675.pdf 526572.pdf 924677.pdf 870521.pdf

[4] 660 Exceptions were present in the output.  626 were due to a failure after the text extraction step of PDF validation (as no output file was present), 31 were null pointer exceptions, with 3 other misc exceptions.  Due to the nature of the dataset, these sort of errors are expected and given the small number of issues, they are fixable.  The files that failed are identifiable so testing could be run against those in particular.

Selected results from Flint output:

Test Number of files
   
DRM detected
9791
4.2%
NOTE: checks are not currently made against print/copy restrictions etc
Well formed failure
19166
8.3%
 
Policy Validation execution failure
26184
11.3%
Where policy validation failed (and no PV results are output)
PV encryption check (present)
6609
2.9%
 
PV damaged fonts present
89
0.04%
 
PV javascript present
221
0.1%
 
PV embedded files present
4873
2.1%
 
PV multimedia present
66
0.03%
 

As Flint provides results as a table, different policies can be checked against the Flint output as various times.  For example - whether or not files contain multimedia may become an issue and this can then be determined from the results.

Assessment of non-measurable points

For some evaluation points it makes most sense to a textual description/explanation

Please include a note about goals-objectives omitted, and why.

Technical details

Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)

Platform: http://wiki.opf-labs.org/display/SP/BL+Hadoop+Platform

Source code: [Jan 14] https://github.com/bl-dpt/drmlint/commit/ecca9a28fe095bed6b770e59046d17d7e595fd09

[July 2014] https://github.com/openplanets/flint

Evaluation notes

Could be such things as identified issues, workarounds, data preparation, if not already included above

Files are kept in HDFS, it's possible that sequence files might help in this instance as the files themselves are relatively small.

There are issues with the JVM but it was not possible to easily upgrade the JVM on the cluster

Some really broken files crashed the JVM/libraries and it is not possible to protect against JVM crashes. There are known to be issues with some of the input files in Govdocs1 (see above issue reports).

Conclusion

Testing with the policy checks takes approximately three times as long as the basic checks.  Extrapolating from the test dataset for this evaluation, it would be possible to process 1TB of PDF files, with policy checks, in less than 4.5 days on our Hadoop cluster.  This is acceptable for using on a routine basis, should that be necessary.  Although PDF files can be relatively small, Flint's execution speed does not appear to suffer from the small files problem - its processing is CPU bound, not I/O bound, as evidenced by the 9.55GB/h processing speed (i.e. 2.7MB/s read speed).

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.