Deduplication

Skip to end of metadata
Go to start of metadata
Title
Deduplication
Detailed description Collection owners need a way to easily identify duplicates in a collections.  Duplicates are a common and seemingly simple issue but the fact that it is rarely cracked illustrates the complexity. A collection of several hundred it may be possible to identify manually, but in a collection of thousands how will we know what is there?
Issue champion Jodie Double
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets
Possible Solution approaches Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list
Context Prior to establishing workflows and process, images from special collections were generated and stored on a shared network drive.  The files grew organically and due to the complexity of the store objects were re-scanned multiple times.  Now that the service is moving to a CMS and connected to the repository, cleaning up the store is urgent and will be time consuming to do it manually, or do we simply re-scan the files and start over.  We want to know what is on the drive and the quality of the information (which will be solved by another tool)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets LAVC audio
Leeds image duplicates and versions
Solutions Reference to the appropriate Solution page(s), by hyperlink
Labels:
issue issue Delete
york_hackathon york_hackathon Delete
duplication duplication Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.