Detailed description The French National Library has begun creating EPUBs from its digitized books. They are also available as images (TIFF masters, JPEG dissemination) and increasingly as full text through OCR, in a local version of the Alto format:
Would there be a way to check the EPUBs against the digitized books to make sure no image or text is missing?
Would that be productive given the processes of our contractors, who apparently derive EPUBs from OCR in some capacity?

Conversly, would we be able to tell from the EPUB that no page is missing in the digitized book, using text analysis of the EPUB content, for instance?
Possibly other institutions trying to make their heritage content more readily available to mobile users, as well as publishers repackaging their books for new modes of distribution.
Context So far, BnF has commissioned EPUBs created from previously digitized books that it wished to enrich. Soon, it will start having EPUBs produced in an integrated workflow starting with the original book, including digitization, OCRization, EPUB creation and XHTML table of content creation.

From what we know from the workflow of our contractors, they derive both OCR and EPUB from the scanned images.
At some point, the workflow bifurcates
  • to have the ocerized text corrected for better spelling, and reformatted into chapters,
  • whereas the Alto OCR files are created in a different way, page by page, with a decomposition in "blocks" that correspond to the position of the elements in the scanned pages.
    The EPUB and Alto files thus each have different enrichments and corrections compared to the TIFF images. How can we make sure that the content remains the same throughout the different representations of the book?

    (Potential extra information to collect from the EPUB files:
    In the near future, we have resquested that the images in the EPUB are named after the serial number of the TIFF file from which they are derived.
    We fantasized about adding the Alto coordinates of the image blocks to the image names as well, since in Alto, the blocks in the page are tagged. We could potentially make sure that the "illustration" blocks are all present in the EPUB when the "GraphicalElement" have been properly eliminated)
Datasets The first 159 EPUBs created are currently downloadable at (susceptible to change as the interface evolues).

All have image versions, some have been ocerized. TIFFs and Alto files might be available from BnF for a specific project.
