Check that the METS, OCR, JPEG2000 masters and the PDFs are consistent
Detailed description
As shown in the diagram below, check images and ALTO files information defined in METS against the real files stored in separate Zip files. Also check the number of pages in the PDF file against the number of files in images zip file and ALTO zip file respectively. Report any mismatches.
Excellent write up on wiki of the issue and solution
Thorough consistency check across the collection
Issues with non-safe characters in METS
Would be useful to generate machine readable outputs (XML)
Much of it configurable in terms of file structures but could do more in this area. Could improve on this in a future version
Could add extra intelligence to identify file formats and then apply appropriate cross checking. Might require complex workflow - possible Taverna/SCAPE follow up?
Detailed Evaluation
How well does the solution meet your issue? The solution cross checks the METS of the .TIFF and the PDF so it meets the initial requirements of the issue.
What more would you like the solution to do? A possible future development is the generation of a machine readable output.
Do you think you can implement the solution in your organisation? And
What further investigation/development/testing would be required before implementation at your organisation?
Are there any process, workflow or technical obstacles to implementation? Believe that subject to the usual validation checks it can be implemented in the institution
Summarise the benefits to your organisation that the solution could provide? With over 150 000 books digitised in the collection the solution, especially if it generated a machine readable output, would allow checking of the files without the need for someone to manually check each one.
What potential exists to apply the solution elsewhere? The potential exists to apply the principle in the solution to other migration file combinations.
Tool (link)
The solution has been developed in Java. Some of the Java components have been integrated into the solution, e.g. PDFBox, Apache commons compress, dom4j. Also thanks to Carl for sharing Zip processing code.
User needs to define a list of METS files to be processed in a config file, e.g.