Title |
Deduplication |
Detailed description | Collection owners need a way to easily identify duplicates in a collections. Duplicates are a common and seemingly simple issue but the fact that it is rarely cracked illustrates the complexity. A collection of several hundred it may be possible to identify manually, but in a collection of thousands how will we know what is there? |
Issue champion | ![]() |
Other interested parties |
Any other parties who are also interested in applying Issue Solutions to their Datasets |
Possible Solution approaches | Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list |
Context | Prior to establishing workflows and process, images from special collections were generated and stored on a shared network drive. The files grew organically and due to the complexity of the store objects were re-scanned multiple times. Now that the service is moving to a CMS and connected to the repository, cleaning up the store is urgent and will be time consuming to do it manually, or do we simply re-scan the files and start over. We want to know what is on the drive and the quality of the information (which will be solved by another tool) |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice |
Datasets | LAVC audio Leeds image duplicates and versions |
Solutions | Reference to the appropriate Solution page(s), by hyperlink |
Labels: