View Source

| *One line summary* | Check that the METS, OCR, JPEG2000 masters and the PDFs are consistent \\ |
| *Detailed description* | As shown in the diagram below, check images and ALTO files information defined in METS against the real files stored in separate Zip files. Also check the number of pages in the PDF file against the number of files in images zip file and ALTO zip file respectively. Report any mismatches. \\ |
| *Solution champion* | [~yf508]\\ |
| *Git link* | [|] |
| *Group Evaluation Notes* | * Excellent write up on wiki of the issue and solution
* Thorough consistency check across the collection
* Issues with non-safe characters in METS
* Would be useful to generate machine readable outputs (XML)
* Much of it configurable in terms of file structures but could do more in this area. Could improve on this in a future version
* Could add extra intelligence to identify file formats and then apply appropriate cross checking. Might require complex workflow - possible Taverna/SCAPE follow up? |
| *Detailed Evaluation* \\ | * How well does the solution meet your issue? The solution cross checks the METS of the .TIFF and the PDF so it meets the initial requirements of the issue. \\
* What more would you like the solution to do? A possible future development is the generation of a machine readable output. \\
* Do you think you can implement the solution in your organisation? And
* What further investigation/development/testing would be required before implementation at your organisation?
* Are there any process, workflow or technical obstacles to implementation? Believe that subject to the usual validation checks it can be implemented in the institution \\
* Summarise the benefits to your organisation that the solution could provide? With over 150 000 books digitised in the collection the solution, especially if it generated a machine readable output, would allow checking of the files without the need for someone to manually check each one. \\
* What potential exists to apply the solution elsewhere? The potential exists to apply the principle in the solution to other migration file combinations. \\ |
| *Tool* (link) | |

The solution has been developed in Java. Some of the Java components have been integrated into the solution, e.g. PDFBox, Apache commons compress, dom4j. Also thanks to Carl for sharing Zip processing code.

User needs to define a list of METS files to be processed in a config file, e.g.


If no problem found, the report looks like:


No problem found with image files.

No problem found with ALTO files.

No problem found in PDF file - all associated images/ALTOs found.
Otherwise, it looks like:

Following images only found in METS, not in ZIP:
Following images only found in ZIP, not in METS:
Following ALTO files only found in METS, not in ZIP:
Following ALTO files only found in ZIP, not in METS:
Mismatch found between PDF and images/ALTOs: 22,22,19
{code}{*}Issue:* [AQuA:Inconsistencies between metadata and content]