View Source

h2. Status
{warning:title=Stopped}

h2. Contact

William Palmer, BL (william (.) palmer (@)) bl (.) uk)

h2. User Story

In order to ensure the long-term survival of a research dataset we need to ensure that the copy we hold is manageable, contains the relevant data and is in a format that promotes digital preservation. To this end we need a digital preservation system that can extract data from disparate tabular data sources, compile that data into a single preservation format output file and verify that the relevant data is present.

The dataset in question is a copy of the UK electoral register, which is deposited at The British Library.  The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...).  For long term preservation we would like to hold a normalised copy of this data.



h2. User Requirements/Components

# We need a tool that can identify, extract and store in a preservation format relevant columns from a tabular dataset
## MUST support reading text-based comma-separated files
## MAY support formats with different (user-specified) separators such - Tab for instance
## MUST support reading MS Excel documents
## MAY support other source formats as required
## MUST support using column headings to identify relevant content
## MAY support a user-specified map to identify relevant columns
## MUST output relevant data - details of suitable output format TBD - possibly MIXED, perhaps just a CSV
## SHOULD be capable of detecting duplicate rows within a data source and remove them
## MUST be repeatable across legacy and new datasets
# We need a tool that can verify that the relevant columns and their data exist in the preservation format
## MUST ensure that all data identified as relevant by 1 is correctly migrated (exists in) the output format
## (This tool could be the final step in the migration tool (1))


h2. Experiments

_Create experiments as child pages and they should appear automatically here_
{pageTree:[email protected]}


h2. Additional information

A blog post that contains more information about this work can be found here: [http://www.openplanetsfoundation.org/blogs/2013-03-01-tabular-data-normalisation-tool]


h2. Developer Notes

It is possible such a tool would also be very helpful during the final selection process by a researcher:

[http://www.lib.cam.ac.uk/dataman/pages/selection.html]

for example. Pre-preservation, granted, but should make preservation easier.