Using Perl to write scripts to find duplicates for reporting on the content of the collection.
Perl was used to write scripts that used the metadata that was extracted using Apache Tika to help locate duplicates and different versions of the same document:
- Searching for keywords in filenames - This can help find duplicates or different versions of the same document across the collection speeding up the process of appraisal. The search creates a list of the filepath for each file. The archivist is then able to go through this list and double check whether the file should be retained or destroyed. This will greatly speed up appraisal work - to date finding duplicates is largely done by memory and archivists have to open up every folder or even file. The filepaths can also show where there is potentially duplication of whole folders. It does mean that the archivists need to have a list of search terms but this can be done through an initial survey of the collection.
- Searching for keywords in the text - once we have an idea of important terms for the collection a script can be used to search for terms within the files, creating a list of all files with the filepath
- Using checksums - checksums are normally used to see whether a file has been changed but we have used them to find absolute duplicates in the collection by looking at the content of the file. A list of all files with the same checksum is produced with the filepaths meaning archivists are able to locate duplicates, check them and then remove them. There is potential to develop this task so that duplicates are deleted when destroyed but at this stage it was decided files should be checked manually.
Produce a report summarising collection metadata and content
Tool Registry Link
Add an entry to the OPF Tool Registry, and provide a link to it here.
Any notes or links on how the solution performed.