Resources
GoPortis Github project
https://github.com/friesey/PdfEventPrep
Installation of VirtualBox
In order to check that your VirtualBox and Vagrant installs are working please open a terminal (command line interface) in an empty directory and type:
You should see:
If so try typing:
This should start up a virtual machine image, it will take a minute or two and the output should start:
Once it's finished to test that it's worked try:
which should give the output:
If that's the case tidy up by typing:
If anything seems to go amiss feel free to contact our Technical Lead: carl [at] openplanetsfoundation [dot] org
Software
- JHOVE
Bespoke PDF Module used by DP Community.
- Apache Tika
Open Source characterisation / content extraction tool.
- Apache PDF Box
The Open Source PDF parsing library that powers Apache Tika
.
- pdfeh
PDF Box preflight functionality wrapping.
- pdf-preflight
A Ruby pre-flight project on GitHub.
Ideas
Some ideas which kind of tools as an output can be useful to build during the Hackathon.
- Create a scalable test if the PDF file can be opened by the Acrobat reader by using the (but I guess that is not open source) PDF Library
- Create a scalable comparison workflow by converting both PDF files (original and new representation) to images and compare via e. g. matchbox tool if there are visible difference
- Idea Andres/Slub about Repair PDF + QA: Save all the PDF objects like e. g. streams, hashmaps, strings, floats in a list and save the MD5 checksum. Reapir the structures of the PDF and put all the objects back. MD5 should not have changed.