CSV listing of Aggregations of Duplicates in a Dataset

Skip to end of metadata
Go to start of metadata

Title
CSV listing of Aggregations of Duplicates in a Dataset

Detailed description
This solution uses a Python (3.3) script to process a tab-delimited checksum file (checksum tab path) to create a CSV file listing every directory found with a file count and a duplicate file count. This allows a user to easily see where high concentrations of duplicates exist.

Tip: Open the CSV in a spreadsheet and add a column to display percent of duplication (add function to the field.  =dups field / count field) to see degree of duplication for each directory.

Note: The script started as an attempt to create the JSON data needed to feed a TreeMap visualization but was reduced in scope to this solution to fit time constraints.

Example: The Environmental Artists Datasets use case includes an iMac drive. We created a checksum manifest using Jacksum and then modified the manifest to be tab-delimited. This manifest was then processed and produced the afore described CSV file. The checksum manifest and result file can be found in iMac_dup.zip.

Solution Champion
Seth Shaw

Corresponding Issue(s)
Identifying Aggregations of Duplicates in a Dataset

Tool/code link
checksums2dups on GitHub

Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.

Evaluation
Any notes or links on how the solution performed.

Labels:
chapel_hill chapel_hill Delete
solution solution Delete
de-duplication de-duplication Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.