William Palmer, British Library
- Using Govdocs1 corpus (231,683 PDFs/ 127.8GB) for initial testing - http://digitalcorpora.org/corpora/files/
Seeking access to internal dataset of PDFs (~40k) (not currently tested)
- Very small internal-only dataset of EPUBs (not currently tested)
Uses a simple Hadoop MapReduce program (FlintHadoop) to execute Flint over input files stored in HDFS. Using sequence files for input would require additional changes to the code and the benefit may be minimal.
Flint currently uses Apache PDFBox, iText, Jhove, EpubCheck, Tika and Calibre along with its own code, to determine if files are valid, and whether or not they contain DRM. Results from each test/tool are provided in the output XML.
For PDF files all the tools/libraries used for analysis are written in Java.
Flint source code: https://github.com/openplanets/flint
The MapReduce steps are as follows:
Map: retrieve file from HDFS and run Flint on it. Flint checks for validity, DRM, validates against a policy, produces a report xml file, and additionally extracts text from the PDF.The report and extracted text are placed into a zip file that is stored in HDFS.
Reduce: process each of the outputs from Flint and produce a csv that has one line per file containing detailed results
Flint contains the following checks for PDF files (all Java code):
- Check with Apache PDFBox
- Runs Apache Preflight - if a PDF syntax error is detected then fails validation
- Then tries to extract text from the PDF using Apache PDFBox
- Check with iText
- Try and extract text from each page of the PDF, fail validity checks if errors encountered
- Check PDDocument.isEncrypted() with Apache PDFBox
- Manual scan for "/encrypt" keyword in the PDF
- Check PdfReader.isEncrypted() with iText
- NOTE: checks are not currently made against print/copy restrictions etc
Ideally the current checks for validity and DRM will be validated against a set of files with a known ground-truth.
For the July 2014 evaluation:
The evaluation completed in July 2014 performs the same checks as before and adds the additional policy validation check.
Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics
ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0
Links to results of the experiment using the evaluation template.