| *One line summary* | How do we ensure that duplicate data is not archived? |
| *Detailed description* | Duplicate images and data can exists for various reasons. Images may be scanned twice, may be duplicated inadvertently during processing, or the original archive may include duplicate documents. How do we weed these out of our digital archive? |
| *Issue champion* | Toby Atkin-Wright |
| *Possible approaches* | Currently the Brightsolid project ensures that each issue date for each newspaper is unique, so if the metadata is correct, there should be no duplicates. \\
It also checks that each of the delivered JP2 and ALTO files have a unique SHA256 fingerprint. \\
Suggested enhancements include using fuzzy OCR to compare page content, and match any pages that appear to have similar content. This could be applied just to headlines throughout the newspaper issues, as these are higher quality data. (The headlines are all manually QCed after OCR, so are the best quality data in the pages.) |
| *Context* | |
| *AQuA Solutions* | [AQuA:Perceptual Image Diff comparison]\\
[AQuA:java image blocks comparison]\\
[AQuA:ssdeep for duplicate image detection]\\ |
| *Collections* | [Brightsolid digitisation of British Library newspapers] |
| *Detailed description* | Duplicate images and data can exists for various reasons. Images may be scanned twice, may be duplicated inadvertently during processing, or the original archive may include duplicate documents. How do we weed these out of our digital archive? |
| *Issue champion* | Toby Atkin-Wright |
| *Possible approaches* | Currently the Brightsolid project ensures that each issue date for each newspaper is unique, so if the metadata is correct, there should be no duplicates. \\
It also checks that each of the delivered JP2 and ALTO files have a unique SHA256 fingerprint. \\
Suggested enhancements include using fuzzy OCR to compare page content, and match any pages that appear to have similar content. This could be applied just to headlines throughout the newspaper issues, as these are higher quality data. (The headlines are all manually QCed after OCR, so are the best quality data in the pages.) |
| *Context* | |
| *AQuA Solutions* | [AQuA:Perceptual Image Diff comparison]\\
[AQuA:java image blocks comparison]\\
[AQuA:ssdeep for duplicate image detection]\\ |
| *Collections* | [Brightsolid digitisation of British Library newspapers] |