Skip to end of metadata
Go to start of metadata

Investigator(s)

William Palmer, British Library

Datasets

Govdocs1 Corpus

  1. Using Govdocs1 corpus (231,683 PDFs/ 127.8GB) for initial testing - http://digitalcorpora.org/corpora/files/
  2. Seeking access to internal dataset of PDFs (~40k) (not currently tested)
  3. Very small internal-only dataset of EPUBs (not currently tested)

Platform

BL Hadoop Platform

Workflow

Uses a simple Hadoop MapReduce program (FlintHadoop) to execute Flint over input files stored in HDFS.  Using sequence files for input would require additional changes to the code and the benefit may be minimal.

Flint currently uses Apache PDFBox, iText, Jhove, EpubCheck, Tika and Calibre along with its own code, to determine if files are valid, and whether or not they contain DRM.  Results from each test/tool are provided in the output XML. 

For PDF files all the tools/libraries used for analysis are written in Java.

DRMLint: https://github.com/willp-bl/drmlint

Flint source code: https://github.com/openplanets/flint

The MapReduce steps are as follows:

Map: retrieve file from HDFS and run Flint on it.  Flint checks for validity, DRM, validates against a policy, produces a report xml file, and additionally extracts text from the PDF.The report and extracted text are placed into a zip file that is stored in HDFS.

Reduce: process each of the outputs from Flint and produce a csv that has one line per file containing detailed results

Flint contains the following checks for PDF files (all Java code):

Validity:

  • Check with Apache PDFBox
    • Runs Apache Preflight - if a PDF syntax error is detected then fails validation
    • Then tries to extract text from the PDF using Apache PDFBox
  • Check with iText
    • Try and extract text from each page of the PDF, fail validity checks if errors encountered

DRM:

  • Check PDDocument.isEncrypted() with Apache PDFBox
  • Manual scan for "/encrypt" keyword in the PDF
  • Check PdfReader.isEncrypted() with iText
  • NOTE: checks are not currently made against print/copy restrictions etc

Ideally the current checks for validity and DRM will be validated against a set of files with a known ground-truth. 

For the July 2014 evaluation:

DRMlint, now renamed Flint, has been further developed and extended to now check PDF files against an institutional policy.  This work is based on Apache PDFBox Preflight/Schematron work from the KB.  For example - JavaScript or embedded files can now be detected, and a policy can be defined that will "fail" such PDF files, allowing institutions to validate against their own institutional policies.  Full results are in the Reducer output and can be easily post-processed.

The evaluation completed in July 2014 performs the same checks as before and adds the additional policy validation check.

Requirements and Policies

Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics

ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0

Evaluations

Links to results of the experiment using the evaluation template.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.