Matching equivalent files of different formats

Version 7 by Jo Gilham
on Jul 16, 2013 18:15.

compared with
Current by Jo Gilham
on Jul 16, 2013 18:32.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (8)

View Page History

*Detailed description*

We wish need to produce a tool which will enable us be able to tie together these different files across the directory structures according to filenames and expected migration formats. We would like to do this on existing data and also be able to to use the tool on future archives.

The tool would need to work This could happen in either a Unix or Windows environment and produce a report that could be visually checked as not all file relationships would be programatically identifiable. So an interface that identified orphaned files and allowed relationships to be made would be a plus. The final relationships could then be exported as a text files or similar which we would then import into our collections management system.

We currently have the ReAct tool which was produced through started at a similar Mash up and completed with the help of a SPRUCE award and it would be good to extend the functionality of this tool to do more robust relationship identification (currently it matches on filenames and the user specifies the file extensions to match and also it has problems with non unique filenames across directories).

This is an example of some of the file structures and the file formats that can be involved:



Files generally retain the same file names and underlying directory structure (after being grouped by file type) and we can supply a list of usual file conversions from delivery to preservation and dissemination file formats.

{color:#333333}shp, .shx, .dbf, .sbn and .sbx, .fbn and .fbx, .ain and .aih, .prj and .xml{color} {color:#333333}*\-> ZIP{*}{color} {color:#333333}(dissemination){color}\\

Ideally the solution could be windows based but could be unix.
The aim is to produce metadata which ties together all the different versions of a single received object so that when we come to migrate files (as we are now doing with CAD files) we can know exactly which files are equivalent copies and therefore which ones we need to migrate to avoid doing multiple file migrations on what is essentially the same data.

*Issue champion*