||Check content of e-pub against digitized book|
|Detailed description|| The French National Library has begun creating EPUBs from its digitized books. They are also available as images (TIFF masters, JPEG dissemination) and increasingly as full text through OCR, in a local version of the Alto format: http://bibnum.bnf.fr/ns/alto_prod.xsd
Would there be a way to check the EPUBs against the digitized books to make sure no image or text is missing?
Would that be productive given the processes of our contractors, who apparently derive EPUBs from OCR in some capacity?
Conversly, would we be able to tell from the EPUB that no page is missing in the digitized book, using text analysis of the EPUB content, for instance?
|Issue champion||Louise Fauduet|
| Other interested parties
|| Possibly other institutions trying to make their heritage content more readily available to mobile users, as well as publishers repackaging their books for new modes of distribution.
|Possible Solution approaches||
|Context|| So far, BnF has commissioned EPUBs created from previously digitized books that it wished to enrich. Soon, it will start having EPUBs produced in an integrated workflow starting with the original book, including digitization, OCRization, EPUB creation and XHTML table of content creation.
From what we know from the workflow of our contractors, they derive both OCR and EPUB from the scanned images.
At some point, the workflow bifurcates
|Lessons Learned|| Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice .
|Datasets|| The first 159 EPUBs created are currently downloadable at http://gallica.bnf.fr/ebooks?&lang=EN (susceptible to change as the interface evolues).
All have image versions, some have been ocerized. TIFFs and Alto files might be available from BnF for a specific project.
|Solutions||Reference to the appropriate Solution page(s), by hyperlink.|