Identifying Aggregations of Duplicates in a Dataset
I am an archivist and my collection consists of multiple computers or hand held media. I would like to determine if there are duplicates and the degree of duplication across the media in the collection. The degree of duplication is important for the following reasons:
- My archive has limited staff resources available to process this born-digital archive. Knowing the degree of duplication across the media in the collection will allow me to determine where I should target my limited staff resources
- The collection creator wants to sell or give his/her collection to my archive. How much of what they are offering is duplicative?
- I'm a researcher and I have limited time to work with a dataset. What is unique and what is duplicative?
Note: File deduplication is a well known space-saving technique in production by many storage tools and as stand-alone utilities.This case differs in that we are interested in identifying clusters of duplicated files either for appraisal, prioritization, or identifying relationships between groups of materials. No deduplication tools we identified visualized the locations and prevalence of duplication. The visualization is the heart of this use case.
Other interested parties
Digital Archivists, researchers, curatorial staff negotiating acquisitions
Possible Solution approaches
- Identification of duplication:
- compare filenames, parent folder names, and sizes for duplicates
- fuzzy-hashes (includes files with close-matches & versions)
- treemap visualization of the degree of duplication across the datasets using degrees of shading
- Venn-diagram of aggregates
- (update 9:45am June 5) we have had difficulty implementing a visualization and have decided as an intermediate step to have as a deliverable a .csv output that lists the directories, file count, and duplicate count. This can be a source set for testing other visualizations. See the Solutions list below.
Possible Treemap tools:
- DrasticData (limited hierarchy available; won't support a deep directory tree.)
- TreeMapper (commercial tool)
- D3.js (example only displays the top level)
Draft base workflow:
- Generate checksum list of files
- Process checksum list into JSON file-tree structure with node variables:
- load into visualization
Will allow archivists and collection curators to determine the degree of duplication within and across datasets (computers, hand held media) and generate reports on duplication
Visualization of the duplication across the dataset will be useful for an archivist in determining how they apply their limited processing resources
Researchers might be interested in the degree of duplication.
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Environmental Artists Datasets