Skip to end of metadata
Go to start of metadata

Investigator(s)

William Palmer, BL

Dataset

Name and link to existing dataset with additional notes if required.

A ~1.4TB set of geospatial data files from Ordnance Survey (GB & NI).  Main data types are GML and NTF, but several other file types are included.

Within the dataset file sizes vary significantly; the mean filesizes for the four main filetypes (by total size) are:

Filetype Mean size
NTF
214KB
GML (Gzipped)
3MB
TIF
7.2MB
ISO
3113MB

Platform

BL Hadoop Platform

Workflow

The workflow is implemented as a native MapReduce program (GeoLintHadoop), based on previous Flint/DRMLint work.

GeoLintHadoop is responsible for recovering the file from HDFS, to a local temporary directory  for processing.  This is necessary as GeoLint uses GDAL.OGR JNI libraries to read through NTF files and that requires a file to be available.

To reduce the time it takes GeoLintHadoop to process the files it generates a series of checksums (cksum CRC, CRC32, MD5, SHA-1 and SHA-256) at the same time as copying the data.

Once this is complete GeoLintHadoop calls GeoLint to process the file.  The following steps are performed:

  1. The file is identified using Apache Tika, using a custom-mimetypes.xml specifically relating to geospatial files that is included in GeoLint
  2. If the file is a GML file it is validated against the Ordnance Survey XML Schema (http://www.ordnancesurvey.co.uk/xml/schema/)
  3. If the file is an NTF file it is validated using internal code and using the GDAL/OGR library (http://www.gdal.org/drv_ntf.html)
  4. The resulting XML is passed to the Reducer for collation
  5. Any Exceptions are also reported in the Reducer outputs

The XML output from GeoLint can be used for post processing.  The output for 1.03TB of data is a ~1.6GB XML file.

Checksum manifests can be verified against that data, or statistics can be derived from that data, such as mimetype, extension, number of files of a particular type etc.

GeoLint code is here: https://github.com/bl-dpt/geolint

Requirements and Policies

Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics

ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0

Evaluations

Links to results of the experiment using the evaluation template.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.