Check content of e-pub against digitized book

Skip to end of metadata
Go to start of metadata
Title
Check content of e-pub against digitized book
Detailed description The French National Library has begun creating EPUBs from its digitized books. They are also available as images (TIFF masters, JPEG dissemination) and increasingly as full text through OCR, in a local version of the Alto format: http://bibnum.bnf.fr/ns/alto_prod.xsd
Would there be a way to check the EPUBs against the digitized books to make sure no image or text is missing?
Would that be productive given the processes of our contractors, who apparently derive EPUBs from OCR in some capacity?

Conversly, would we be able to tell from the EPUB that no page is missing in the digitized book, using text analysis of the EPUB content, for instance?
Issue champion Louise Fauduet
Other interested parties
Possibly other institutions trying to make their heritage content more readily available to mobile users, as well as publishers repackaging their books for new modes of distribution.
Possible Solution approaches
Context So far, BnF has commissioned EPUBs created from previously digitized books that it wished to enrich. Soon, it will start having EPUBs produced in an integrated workflow starting with the original book, including digitization, OCRization, EPUB creation and XHTML table of content creation.

From what we know from the workflow of our contractors, they derive both OCR and EPUB from the scanned images.
At some point, the workflow bifurcates
  • to have the ocerized text corrected for better spelling, and reformatted into chapters,
  • whereas the Alto OCR files are created in a different way, page by page, with a decomposition in "blocks" that correspond to the position of the elements in the scanned pages.
    The EPUB and Alto files thus each have different enrichments and corrections compared to the TIFF images. How can we make sure that the content remains the same throughout the different representations of the book?


    (Potential extra information to collect from the EPUB files:
    In the near future, we have resquested that the images in the EPUB are named after the serial number of the TIFF file from which they are derived.
    We fantasized about adding the Alto coordinates of the image blocks to the image names as well, since in Alto, the blocks in the page are tagged. We could potentially make sure that the "illustration" blocks are all present in the EPUB when the "GraphicalElement" have been properly eliminated)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice .
Datasets The first 159 EPUBs created are currently downloadable at http://gallica.bnf.fr/ebooks?&lang=EN (susceptible to change as the interface evolues).

All have image versions, some have been ocerized. TIFFs and Alto files might be available from BnF for a specific project.
Solutions Reference to the appropriate Solution page(s), by hyperlink.
Labels:
opf_montpellier opf_montpellier Delete
issue issue Delete
qa qa Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.