Skip to end of metadata
Go to start of metadata

Status

Active

This scenario is related to success story: http://wiki.opf-labs.org/display/SP/Detecting+duplicates+on+large+collections+of+digitized+book+pages

Contact

Sven Schlarb

User Story

As a cultural heritage institution, we need a digital preservation system that can identify books within a large digital book collection that contain duplicated book pages and inform us of the pages within those books that are duplicate images.

User Requirements/Components

  1. We need to be able to identify duplicate images
    1. 'Duplicate' needs defining:
      1. Images with different scales are NOT duplicates
      2. Images that have been rotated by any amount, but are otherwise the same ARE duplicates
      3. A similarity metric would perhaps help identify 'fuzzy' duplication
      4. Pages that have artifacts such as hands or folded corners, but are otherwise the same ARE duplicates

Experiments

Create experiments as child pages and they should appear automatically here

Executing Matchbox over large scale collection of digitized books to find duplicates in one book (SS)

Executing Matchbox to find duplicates in different representations of the same book - e.g. to identify same copies and also evaluate if a newer copy is of better quality or not of the old (SS)

Data: ONB digitized books. Not sharable but maybe some subsets.
Workflow: No workflow yet - need packages.
Issues: Missing packages, so difficult to install Matchbox.
Matchbox performance.

Developer Notes

User Story

As a customer of high-quality master holding digitization agencies, I need a digital preservation system that will enable me to compare digitized images of books and associated OCR provided by that agency with digitized images and OCR provided previously to ensure that the new versions are improvements to the previous copies and as such can replace the older versions as my preservation copy. We do not have access to the high-quality master copies, but we do hold the physical material.

User Requirements/Components

  1. We need to be able to compare digitized images for similarity
  2. We need to be able to assess whether or not an image's quality has improved.
    1. How? Less noise? Larger resolution but otherwise identical (so there a duplicate IS a scale copy, provided the new one is upscaled?)
  3. We need to determine the similarity (distance) between two OCR outputs.

Related Documents

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.