Identification of file formats with incorrect file extensions

Skip to end of metadata
Go to start of metadata
Title Identification of file formats with incorrect file extensions
Detailed description Electronic documents and image files are deposited on a variety of media, including floppy disc, CD-R and memory sticks.  In copying process, file extensions can be lost, or period marks in file names result in everything after period mark being read as a file extension, resulting in unreadable files because correct file association has been lost. Time consuming to identify where unreadable file extensions are genuine, but unusual file types, or are incorrect extensions.  Time consuming, hit-and-miss process currently used to try and identify file types and access content, with potential impact on authenticity & integrity of files in the process.
Issue champion Hannah Green
Other interested parties Rebecca Nielsen 
Richard Freeston
Possible Solution approaches
  • use Apache Tika to identify file types & extract metadata
  • develop script which will run over directory of files at once, allowing for quick identification of large quantity of files
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets Seven Stories author & illustrator files
Solutions Tika Batch File Identification
issue issue Delete
spruce spruce Delete
spruce_glasgow spruce_glasgow Delete
identification identification Delete
unknown_file_formats unknown_file_formats Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.