Establishing procedures for the effective management of personal collections.
There are a number of issues that make the digital preservation of the collections of individuals particularly difficult. Individuals do not tend to manage their own records in a structured way as organisations do. Typical issues include:
- Massive duplication - both exact duplicates and numerous drafts of documents where there is very little difference in the content
- Descriptive metadata is minimal
- There is little or no sensible organisation of the collection
- Third party copyright is difficult to ascertain
- It is difficult to locate documents that need to be closed under Data Protection legislation
- It is difficult to locate files that are not worthy of permanent preservation and need to be remove
All issues are endemic to personal collections irrespective of format but the scale of electronic collections and the way they are managed makes it difficult to sort and catalogue the collection and find out about the content of the collection (including names, subjects, access rights etc). It is therefore difficult for archivists to confidently make the collection accessible to users in a usable way. The archive service needs to develop workflows using available toolkits to automate a number of tasks to speed up selecting files for permanent preservation, the creation of descriptive metadata, the removal of duplicates, and the creation of subjects, names and keywords.
Other interested parties
Possible Solution approaches
- Short term investment in staffing to free up time to develop policies and procedures
- Investment in staff development – currently staff do not have the skills or knowledge to utilise technical solutions. Training is required
- Collaboration with IT – knowledge and skills gaps can be partly solved in a cost effective way by working together
- Collaboration with the local HEI and professional community – knowledge and skills gaps can be solved in a cost effective way by working together
- Use toolkits including Tika (with customised wrapper) to extract metadata and text content
- Use Perl to write scripts - the metadata files will be used to do a direct file name comparison to find duplicates and files with similar file names, use checksums, and to look for popular terms within files
The collection is one of more than 130 paper and electronic deposited collections management by a small archive team which is part of the Library team at a medium sized university. It is a specialist library and the archives is the main repository in the UK for its subject area. The archive accepts c5-10 new collections a year ranging in size and is increasingly approached to accept either hybrid or completely electronic collections. Permanent staffing is 2.6FTE and the section is responsible for preservation, cataloguing, collection development, outreach, fundraising and digitisation for archives as well as records management for the organisation. As a result the section requires a systematic workflow to ensure it is capable of accepting and making electronic archive collections accessible with a small staff. The dataset was accepted in January 2011 before any procedures were in place to accept electronic records. The level of work required to sort, catalogue and make the collection accessible is unsustainable in the long-term. We need to automate as much of the sorting work as possible.
Apache Tika is a useful tool for extracting descritpive metadata that can help with cataloguing the collection, particularly for creating collection level descriptions. By creating an HTML report including details on total extent, covering dates, popular terms (by creating word clouds), names of individuals and organisations, and formats. By listing the files underneath the names of individuals and organisations Archivists can easily locate these records. It has also made apparent that although we can automate the extraction of this information from the collection we are reliant on record creators to record this information at point of record creation.
Perl is a useful tool for reporting on the contents of the collection. Finding duplicates (both exact and versions of the same document) can be done by using scripts and checksums. This will aid the appraisal and sorting of collections by listing the filepaths of files that are exactly the same or appear to be very similar which the Archivist can then manually check.
History Workshop Journal - Digital Archive Deposit
Extracting and aggregating metadata with Apache Tika
Using Perl to write scripts for reporting on the content of the collection