The dataset in question is a copy of the UK electoral register, which is deposited at The British Library. The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...). For long term preservation we would like to hold a normalised copy of this data.
BL Hadoop Platform: http://wiki.opf-labs.org/display/SP/BL+Hadoop+Platform
Code implementation is here: https://github.com/openplanets/tabular-data-normaliser
A detailed descriptions of the workflow is available in this blog post: http://www.openplanetsfoundation.org/blogs/2013-03-01-tabular-data-normalisation-tool
The code makes use of MapReduce/Hadoop, data normalisation of an input file occurs in the map, collation of results occurs in the reduce phase.
An assessment was made that the code was generic enough to be used on other datasets, through using another test set of data and checking the outputs. The only required change was a new version of the normalisation properties file relevant to the input data.