Finding duplicate images

Skip to end of metadata
Go to start of metadata
One line summary How do we ensure that duplicate data is not archived?
Detailed description Duplicate images and data can exists for various reasons. Images may be scanned twice, may be duplicated inadvertently during processing, or the original archive may include duplicate documents. How do we weed these out of our digital archive?
Issue champion Toby Atkin-Wright
Possible approaches Currently the Brightsolid project ensures that each issue date for each newspaper is unique, so if the metadata is correct, there should be no duplicates.
It also checks that each of the delivered JP2 and ALTO files have a unique SHA256 fingerprint.
Suggested enhancements include using fuzzy OCR to compare page content, and match any pages that appear to have similar content. This could be applied just to headlines throughout the newspaper issues, as these are higher quality data. (The headlines are all manually QCed after OCR, so are the best quality data in the pages.)
Context  
AQuA Solutions Perceptual Image Diff comparison
java image blocks comparison
ssdeep for duplicate image detection
Collections Brightsolid digitisation of British Library newspapers
Labels:
image image Delete
issue issue Delete
duplication duplication Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.