View Source

h1. Resources

h2. GoPortis Github project

https://github.com/friesey/PdfEventPrep

h2. Installation of VirtualBox

{color:#222222}In order to check that your VirtualBox and Vagrant installs are working please open a terminal (command line interface) in an empty directory and type:{color}

{code}vagrant init ubuntu/precise32{code}


You should see:
{code}A `Vagrantfile` has been placed in this directory. You are now ready to `vagrant up` your first virtual environment! Please read the comments in the Vagrantfile as well as documentation on`vagrantup.com` for more information on using Vagrant.{code}

If so try typing:

{code}vagrant up{code}

This should start up a virtual machine image, it will take a minute or two and the output should start:

{code}Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/precise32'...{code}

Once it's finished to test that it's worked try:

{code}vagrant ssh
ls /vagrant{code}

which should give the output:

{code}vagrant@vagrant-ubuntu-precise-32:~$ ls /vagrant
Vagrantfile
vagrant@vagrant-ubuntu-precise-32:~${code}

If that's the case tidy up by typing:

{code}exit
vagrant halt
vagrant destroy{code}

If anything seems to go amiss feel free to contact our Technical Lead: carl [at] openplanetsfoundation [dot] org

h2. Software

* [JHOVE |https://github.com/gmcgath/jhove] Bespoke PDF Module used by DP Community.
* [Apache Tika |https://tika.apache.org/] Open Source characterisation / content extraction tool.
* [Apache PDF Box |http://pdfbox.apache.org] The Open Source PDF parsing library that powers [Apache Tika |https://tika.apache.org/].
* [pdfeh |https://github.com/openplanets/pdfeh] PDF Box preflight functionality wrapping.
* [pdf-preflight |https://github.com/yob/pdf-preflight] A Ruby pre-flight project on GitHub.

h1. Ideas

Some ideas which kind of tools as an output can be useful to build during the Hackathon.

* Create a scalable test if the PDF file can be opened by the Acrobat reader by using the (but I guess that is not open source) [PDF Library |http://www.adobe.com/devnet/pdf/library.html]
* Create a scalable comparison workflow by converting both PDF files (original and new representation) to images and compare via e. g. matchbox tool if there are visible difference
* Idea Andres/Slub about Repair PDF + QA: Save all the PDF objects like e. g. streams, hashmaps, strings, floats in a list and save the MD5 checksum. Reapir the structures of the PDF and put all the objects back. MD5 should not have changed.