Produce a report summarising collection metadata and content
Detailed description
Bishopsgate Library has received several digital archive deposits comprising large numbers of files in various formats and with little organisation. When receiving such deposits, it would be useful to be able to produce a report which aggregates and summarises the metadata (e.g. file formats, dates, authors) and content (e.g. common keywords, number of pages) of the files. This information could be used both to produce top-level catalogue records and to inform preservation decisions. It could also be used to identify potential issues, such as duplicate files and problematic file formats.
Any solutions would preferably be cross-platform and easy to use, with a simple CLI or GUI front end. The report produced should be comprehensible for non-technical staff.
Issue champion
Thom Carter
Other interested parties
Rebecca Webster
Possible Solution approaches
- Test for duplicate files
- Extract metadata and textual content of files
- Analyse metadata and content across the dataset and produce a report
Context
Bishopsgate Library has substantial digital collections, which are continually being added to through both an ongoing digitisation programme and regular deposits of born-digital content. The library does not yet have a formal digital preservation strategy in place, or dedicated storage arrangements for its digital collections.
Lessons Learned
- Apache Tika is a useful tool for extracting metadata and text content from files
- Perl is a power tool for creating reports on the contents of collections
- It is difficult to test large text documents for similarity with any degree of accuracy
Datasets
History Workshop Journal Archive - Digital Archive Deposit
Solutions
Extracting and aggregating metadata with Tika
Using Perl to write scripts to find duplicates and find keywords