User Story

In order to ensure the long-term survival of a research dataset we need to ensure that the copy we hold is manageable, contains the relevant data and is in a format that promotes digital preservation. To this end we need a digital preservation system that can extract data from disparate tabular data sources, compile that data into a single preservation format output file and verify that the relevant data is present.

The dataset in question is a copy of the UK electoral register, which is deposited at The British Library.  The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...).  For long term preservation we would like to hold a normalised copy of this data.

User Requirements/Components

  1. We need a tool that can identify, extract and store in a preservation format relevant columns from a tabular dataset
    1. MUST support reading text-based comma-separated files
    2. MAY support formats with different (user-specified) separators such - Tab for instance
    3. MUST support reading MS Excel documents
    4. MAY support other source formats as required
    5. MUST support using column headings to identify relevant content
    6. MAY support a user-specified map to identify relevant columns
    7. MUST output relevant data - details of suitable output format TBD - possibly MIXED, perhaps just a CSV
    8. SHOULD be capable of detecting duplicate rows within a data source and remove them
    9. MUST be repeatable across legacy and new datasets
  2. We need a tool that can verify that the relevant columns and their data exist in the preservation format
    1. MUST ensure that all data identified as relevant by 1 is correctly migrated (exists in) the output format
    2. (This tool could be the final step in the migration tool (1))


Create experiments as child pages and they should appear automatically here

Additional information

A blog post that contains more information about this work can be found here:

Developer Notes

It is possible such a tool would also be very helpful during the final selection process by a researcher:

for example. Pre-preservation, granted, but should make preservation easier.

