Title
Content identification and categorisation.
Detailed description
There have been no consistent naming conventions applied to the dataset, the majority of files have no meaningful titles. In some cases names have been given at folder level but not file level. Copyright ownership of images is ascertainable until they are identified. This makes appraisal and re-use of the images very difficult and time consuming. We would like a tool that could help with appraisal by identifying what type of document is in the image – e.g. map/photograph/written document etc. It would need to be able to cope with large sets of data and be relatively simple to operate.
Issue champion
Cassandra Johnson
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets.
Possible Solution approaches
- pattern recognition software
- FITS for metadata extraction and file format evaluation
Maurice de Rooij: There are several services online that offer an API, webservice or management tool to recognize content of images
Context
DHC has digitised parts of its analogue collections over time in a very inconsistent manner, with various equipment and no standard guidelines on metadata or formats. We want to avoid re-digitising material already digitised but we don't know what we already have. Automation of assessing the collections is preferable to a member of staff going through each of the 20,000 images individually.
Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets
Dorset History Centre collection of digitised images
Solutions
Image content identification and categorisation solution
File Format Identification and Metadata Extraction using FITS