Matching equivalent files of different formats
We need to be able to tie together different files across the directory structures according to filenames and expected migration formats. We would like to do this on existing data and also be able to to use the tool on future archives.
This could happen in either a Unix or Windows environment and produce a report that could be visually checked as not all file relationships would be programatically identifiable. So an interface that identified orphaned files and allowed relationships to be made would be a plus. The final relationships could then be exported as a text files or similar which we would then import into our collections management system.
We currently have the ReAct tool which was started at a Mash up and completed with the help of a SPRUCE award and it would be good to extend the functionality of this tool to do more robust relationship identification (currently it matches on filenames and the user specifies the file extensions to match and also it has problems with non unique filenames across directories).
This is an example of some of the file structures and the file formats that can be involved:
These three folders should contain different versions/formats of the same files taken from the 'original' directory, migrated and placed in preservation and dissemination directories.
The tool would be able to analyse the directories and show the original file in /original and link to the alternative copies in preservation and dissemination.
Files generally retain the same file names and underlying directory structure (after being grouped by file type) and we can supply a list of usual file conversions from delivery to preservation and dissemination file formats.
from the very simple single files:
JPG -> TIF (preservation)
JPG -> JPG (dissemination)
to the more complex multiple file groups:
shp, .shx, .dbf, .sbn and .sbx, .fbn and .fbx, .ain and .aih, .prj and .xml -> .GML & .XSD (preservation)
shp, .shx, .dbf, .sbn and .sbx, .fbn and .fbx, .ain and .aih, .prj and .xml -> ZIP (dissemination)
The aim is to produce metadata which ties together all the different versions of a single received object so that when we come to migrate files (as we are now doing with CAD files) we can know exactly which files are equivalent copies and therefore which ones we need to migrate to avoid doing multiple file migrations on what is essentially the same data.
Other interested parties
Peter Cliff Graham Seaman
Possible Solution approaches
Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page.
Details of the institutional context to the Issue.
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Reference to the appropriate Solution page(s), by hyperlink.