William Palmer, BL


Name and link to existing dataset with additional notes if required.

A ~1.4TB set of geospatial data files from Ordnance Survey (GB & NI).  Main data types are GML and NTF, but several other file types are included.

Within the dataset file sizes vary significantly; the mean filesizes for the four main filetypes (by total size) are:

Filetype Mean size
GML (Gzipped)


BL Hadoop Platform


The workflow is implemented as a native MapReduce program (GeoLintHadoop), based on previous Flint/DRMLint work.

GeoLintHadoop is responsible for recovering the file from HDFS, to a local temporary directory  for processing.  This is necessary as GeoLint uses GDAL.OGR JNI libraries to read through NTF files and that requires a file to be available.

To reduce the time it takes GeoLintHadoop to process the files it generates a series of checksums (cksum CRC, CRC32, MD5, SHA-1 and SHA-256) at the same time as copying the data.

Once this is complete GeoLintHadoop calls GeoLint to process the file.  The following steps are performed:

  1. The file is identified using Apache Tika, using a custom-mimetypes.xml specifically relating to geospatial files that is included in GeoLint
  2. If the file is a GML file it is validated against the Ordnance Survey XML Schema (
  3. If the file is an NTF file it is validated using internal code and using the GDAL/OGR library (
  4. The resulting XML is passed to the Reducer for collation
  5. Any Exceptions are also reported in the Reducer outputs

The XML output from GeoLint can be used for post processing.  The output for 1.03TB of data is a ~1.6GB XML file.

Checksum manifests can be verified against that data, or statistics can be derived from that data, such as mimetype, extension, number of files of a particular type etc.

GeoLint code is here:

Requirements and Policies

Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics

ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0


Links to results of the experiment using the evaluation template.

