William Palmer, BL
Name and link to existing dataset with additional notes if required.
A ~1.4TB set of geospatial data files from Ordnance Survey (GB & NI). Main data types are GML and NTF, but several other file types are included.
Within the dataset file sizes vary significantly; the mean filesizes for the four main filetypes (by total size) are:
|Filetype|| Mean size
| GML (Gzipped)
The workflow is implemented as a native MapReduce program (GeoLintHadoop), based on previous Flint/DRMLint work.
GeoLintHadoop is responsible for recovering the file from HDFS, to a local temporary directory for processing. This is necessary as GeoLint uses GDAL.OGR JNI libraries to read through NTF files and that requires a file to be available.
To reduce the time it takes GeoLintHadoop to process the files it generates a series of checksums (cksum CRC, CRC32, MD5, SHA-1 and SHA-256) at the same time as copying the data.
Once this is complete GeoLintHadoop calls GeoLint to process the file. The following steps are performed:
- The file is identified using Apache Tika, using a custom-mimetypes.xml specifically relating to geospatial files that is included in GeoLint
- If the file is a GML file it is validated against the Ordnance Survey XML Schema (http://www.ordnancesurvey.co.uk/xml/schema/)
- If the file is an NTF file it is validated using internal code and using the GDAL/OGR library (http://www.gdal.org/drv_ntf.html)
- The resulting XML is passed to the Reducer for collation
- Any Exceptions are also reported in the Reducer outputs
The XML output from GeoLint can be used for post processing. The output for 1.03TB of data is a ~1.6GB XML file.
Checksum manifests can be verified against that data, or statistics can be derived from that data, such as mimetype, extension, number of files of a particular type etc.
GeoLint code is here: https://github.com/bl-dpt/geolint
Policy statements that relate to this experiment and any evaluation criteria taken from SCAPE metrics
ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0
Links to results of the experiment using the evaluation template.