Identifying Aggregations of Duplicates in a Dataset

Skip to end of metadata
Go to start of metadata

Identifying Aggregations of Duplicates in a Dataset

Detailed description
I am an archivist and my collection consists of multiple computers or hand held media.  I would like to determine if there are duplicates and the degree of duplication across the media in the collection.  The degree of duplication is important for the following reasons:

  • My archive has limited staff resources available to process this born-digital archive.  Knowing the degree of duplication across the media in the collection will allow me to determine where I should target my limited staff resources
  • The collection creator wants to sell or give his/her collection to my archive.  How much of what they are offering is duplicative?
  • I'm a researcher and I have limited time to work with a dataset.  What is unique and what is duplicative?

Note: File deduplication is a well known space-saving technique in production by many storage tools and as stand-alone utilities.This case differs in that we are interested in identifying clusters of duplicated files either for appraisal, prioritization, or identifying relationships between groups of materials. No deduplication tools we identified visualized the locations and prevalence of duplication. The visualization is the heart of this use case.

Issue champions

Heather Gendron
Seth Shaw
Meg Tuomala
Michael Olson

Other interested parties
Digital Archivists, researchers, curatorial staff negotiating acquisitions

Possible Solution approaches

  • Identification of duplication:
    1. check-sums
    2. compare filenames, parent folder names, and sizes for duplicates
    3. fuzzy-hashes (includes files with close-matches & versions)
  • Visualization
    • treemap visualization of the degree of duplication across the datasets using degrees of shading
    • Venn-diagram of aggregates
    • directory-tree/Node-map
    • (update 9:45am June 5) we have had difficulty implementing a visualization and have decided as an intermediate step to have as a deliverable a .csv output that lists the directories, file count, and duplicate count. This can be a source set for testing other visualizations. See the Solutions list below.

Possible Treemap tools:

Draft base workflow:

  1. Generate checksum list of files
  2. Process checksum list into JSON file-tree structure with node variables:
    • name
    • file-count
    • dup-count
    • dup-locations[]
  3. load into visualization


Will allow archivists and collection curators to determine the degree of duplication within and across datasets (computers, hand held media) and generate reports on duplication

Visualization of the duplication across the dataset will be useful for an archivist in determining how they apply their limited processing resources

Researchers might be interested in the degree of duplication.

Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice

Environmental Artists Datasets

Program on Public Life administrative records and director email

CSV listing of Aggregations of Duplicates in a Dataset

chapel_hill chapel_hill Delete
issue issue Delete
duplication duplication Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.