CSV listing of Aggregations of Duplicates in a Dataset
This solution uses a Python (3.3) script to process a tab-delimited checksum file (checksum tab path) to create a CSV file listing every directory found with a file count and a duplicate file count. This allows a user to easily see where high concentrations of duplicates exist.
Tip: Open the CSV in a spreadsheet and add a column to display percent of duplication (add function to the field. =dups field / count field) to see degree of duplication for each directory.
Note: The script started as an attempt to create the JSON data needed to feed a TreeMap visualization but was reduced in scope to this solution to fit time constraints.
Example: The Environmental Artists Datasets use case includes an iMac drive. We created a checksum manifest using Jacksum and then modified the manifest to be tab-delimited. This manifest was then processed and produced the afore described CSV file. The checksum manifest and result file can be found in iMac_dup.zip.
Identifying Aggregations of Duplicates in a Dataset
checksums2dups on GitHub
Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.
Any notes or links on how the solution performed.