h2. User Story

In order to ensure the long-term survival of a research dataset we need to ensure that the copy we hold is manageable, contains the relevant data and is in a format that promotes digital preservation. To this end we need a digital preservation system that can extract data from disparate tabular data sources, compile that data into a single preservation format output file and verify that the relevant data is present.

The dataset in question is a copy of the UK electoral register, which is deposited at The British Library.  The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...).  For long term preservation we would like to hold a normalised copy of this data.

h2. User Requirements/Components

# We need a tool that can identify, extract and store in a preservation format relevant columns from a tabular dataset
## MUST support reading text-based comma-separated files
## MAY support formats with different (user-specified) separators such - Tab for instance
## MUST support reading MS Excel documents
## MAY support other source formats as required
## MUST support using column headings to identify relevant content
## MAY support a user-specified map to identify relevant columns
## MUST output relevant data - details of suitable output format TBD - possibly MIXED, perhaps just a CSV
## SHOULD be capable of detecting duplicate rows within a data source and remove them
## MUST be repeatable across legacy and new datasets
# We need a tool that can verify that the relevant columns and their data exist in the preservation format
## MUST ensure that all data identified as relevant by 1 is correctly migrated (exists in) the output format
## (This tool could be the final step in the migration tool (1))

h2. Experiments

h2. Additional information

A blog post that contains more information about this work can be found here: []

h2. Developer Notes

It is possible such a tool would also be very helpful during the final selection process by a researcher:


for example. Pre-preservation, granted, but should make preservation easier.