Check consistency between metadata and content

Skip to end of metadata
Go to start of metadata
One line summary Check that the METS, OCR, JPEG2000 masters and the PDFs are consistent
Detailed description As shown in the diagram below, check images and ALTO files information defined in METS against the real files stored in separate Zip files. Also check the number of pages in the PDF file against the number of files in images zip file and ALTO zip file respectively. Report any mismatches.
Solution champion Frank Feng
Git link
Group Evaluation Notes
  • Excellent write up on wiki of the issue and solution
  • Thorough consistency check across the collection
  • Issues with non-safe characters in METS
  • Would be useful to generate machine readable outputs (XML)
  • Much of it configurable in terms of file structures but could do more in this area. Could improve on this in a future version
  • Could add extra intelligence to identify file formats and then apply appropriate cross checking. Might require complex workflow - possible Taverna/SCAPE follow up?
Detailed Evaluation
  • How well does the solution meet your issue? The solution cross checks the METS of the .TIFF and the PDF so it meets the initial requirements of the issue.
  • What more would you like the solution to do? A possible future development is the generation of a machine readable output.
  • Do you think you can implement the solution in your organisation? And
  • What further investigation/development/testing would be required before implementation at your organisation?
  • Are there any process, workflow or technical obstacles to implementation? Believe that subject to the usual validation checks it can be implemented in the institution
  • Summarise the benefits to your organisation that the solution could provide? With over 150 000 books digitised in the collection the solution, especially if it generated a machine readable output, would allow checking of the files without the need for someone to manually check each one.
  • What potential exists to apply the solution elsewhere? The potential exists to apply the principle in the solution to other migration file combinations.
Tool (link)  

The solution has been developed in Java. Some of the Java components have been integrated into the solution, e.g. PDFBox, Apache commons compress, dom4j. Also thanks to Carl for sharing Zip processing code.

User needs to define a list of METS files to be processed in a config file, e.g.

If no problem found, the report looks like:

Otherwise, it looks like:

Issue: Inconsistencies between metadata and content

mets mets Delete
ocr ocr Delete
metadata metadata Delete
jpeg2000 jpeg2000 Delete
jp2k jp2k Delete
pdf pdf Delete
jp2 jp2 Delete
jpx jpx Delete
mj2 mj2 Delete
jpm jpm Delete
zip zip Delete
tiff tiff Delete
taverna taverna Delete
scape scape Delete
xml xml Delete
alto alto Delete
solution solution Delete
aqua aqua Delete
structural_relationships structural_relationships Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.