William Palmer, BL (william (.) palmer (@)) bl (.) uk)
In order to ensure the long-term survival of a research dataset we need to ensure that the copy we hold is manageable, contains the relevant data and is in a format that promotes digital preservation. To this end we need a digital preservation system that can extract data from disparate tabular data sources, compile that data into a single preservation format output file and verify that the relevant data is present.
The dataset in question is a copy of the UK electoral register, which is deposited at The British Library. The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...). For long term preservation we would like to hold a normalised copy of this data.
- We need a tool that can identify, extract and store in a preservation format relevant columns from a tabular dataset
- MUST support reading text-based comma-separated files
- MAY support formats with different (user-specified) separators such - Tab for instance
- MUST support reading MS Excel documents
- MAY support other source formats as required
- MUST support using column headings to identify relevant content
- MAY support a user-specified map to identify relevant columns
- MUST output relevant data - details of suitable output format TBD - possibly MIXED, perhaps just a CSV
- SHOULD be capable of detecting duplicate rows within a data source and remove them
- MUST be repeatable across legacy and new datasets
- We need a tool that can verify that the relevant columns and their data exist in the preservation format
- MUST ensure that all data identified as relevant by 1 is correctly migrated (exists in) the output format
- (This tool could be the final step in the migration tool (1))
Create experiments as child pages and they should appear automatically here
A blog post that contains more information about this work can be found here: http://www.openplanetsfoundation.org/blogs/2013-03-01-tabular-data-normalisation-tool
It is possible such a tool would also be very helpful during the final selection process by a researcher:
for example. Pre-preservation, granted, but should make preservation easier.