Skip to end of metadata
Go to start of metadata

Dataset

The dataset in question is a copy of the UK electoral register, which is deposited at The British Library. The data is submitted by various local authorities in a variety of formats (CSV, XLS, PDF, DOC, ...). For long term preservation we would like to hold a normalised copy of this data.

Platform

BL Hadoop Platform: http://wiki.opf-labs.org/display/SP/BL+Hadoop+Platform

Workflow

Code implementation is here: https://github.com/openplanets/tabular-data-normaliser

A detailed descriptions of the workflow is available in this blog post: http://www.openplanetsfoundation.org/blogs/2013-03-01-tabular-data-normalisation-tool

The code makes use of MapReduce/Hadoop, data normalisation of an input file occurs in the map, collation of results occurs in the reduce phase.

Requirements/Evaluation Criteria/Conditions of Satisfaction

An assessment was made that the code was generic enough to be used on other datasets, through using another test set of data and checking the outputs.  The only required change was a new version of the normalisation properties file relevant to the input data.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.