View Source

h2. Dataset

The dataset in question is a copy of the UK electoral register, which is deposited at The British Library. The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...). For long term preservation we would like to hold a normalised copy of this data.

h2. Platform

BL Hadoop Platform: [|SP:BL Hadoop Platform]

h2. Workflow

Code implementation is here: []

A detailed descriptions of the workflow is available in this blog post: []

The code makes use of MapReduce/Hadoop, data normalisation of an input file occurs in the map, collation of results occurs in the reduce phase.

h2. Requirements/Evaluation Criteria/Conditions of Satisfaction

An assessment was made that the code was generic enough to be used on other datasets, through using another test set of data and checking the outputs.  The only required change was a new version of the normalisation properties file relevant to the input data.