Produce a report summarising collection metadata and content

Skip to end of metadata
Go to start of metadata

Produce a report summarising collection metadata and content

Detailed description
Bishopsgate Library has received several digital archive deposits comprising large numbers of files in various formats and with little organisation. When receiving such deposits, it would be useful to be able to produce a report which aggregates and summarises the metadata (e.g. file formats, dates, authors) and content (e.g. common keywords, number of pages) of the files. This information could be used both to produce top-level catalogue records and to inform preservation decisions. It could also be used to identify potential issues, such as duplicate files and problematic file formats.

Any solutions would preferably be cross-platform and easy to use, with a simple CLI or GUI front end. The report produced should be comprehensible for non-technical staff.

Issue champion
Thom Carter

Other interested parties
Rebecca Webster

Possible Solution approaches

  • Test for duplicate files
  • Extract metadata and textual content of files
  • Analyse metadata and content across the dataset and produce a report

Bishopsgate Library has substantial digital collections, which are continually being added to through both an ongoing digitisation programme and regular deposits of born-digital content. The library does not yet have a formal digital preservation strategy in place, or dedicated storage arrangements for its digital collections.

Lessons Learned

  • Apache Tika is a useful tool for extracting metadata and text content from files
  • Perl is a power tool for creating reports on the contents of collections
  • It is difficult to test large text documents for similarity with any degree of accuracy

History Workshop Journal Archive - Digital Archive Deposit

Extracting and aggregating metadata with Tika
Using Perl to write scripts to find duplicates and find keywords

spruce_london spruce_london Delete
issue issue Delete
appraisal_assessment appraisal_assessment Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.