Skip to end of metadata
Go to start of metadata


William Palmer @ BL dotUK

Evaluation points

Assessment of measurable points
Metric Metric baseline (14th August 2014)
Metric goal 8th August 2014
TotalRuntime 22:31:31 (hh:mm:ss)   08:32:53 (hh:mm:ss)
TotalObjects 2,602,737 [1]   2,602,737 [1]
NumberOfObjectsPerHour 115547   304482
ThroughputGbytesPerHour 46.83   123.4 [2]
ReliableAndStableAssessment TRUE TRUE TRUE
NumberOfFailedFiles 0 0 0
NumberOfFailedFilesAcceptable TRUE - TRUE
  [4]   [3]
  Using 8 simultaneous map slots/nodes
  Using 28 simultaneous map slots/nodes

Note 1:This is a subset of the main geospatial dataset, totaling approximately 1055GB.  These files took 24h37m to be copied from a NAS into HDFS

Note 2: This is 35MB/s (below the benchmarked max I/O rate of between 74-146MB/s - Benchmarking Hadoop installations)

Note 3: Problems were detected in 2371 GML and NTF files, 2214 of which were false positives and GeoLint has subsequently been modified to correctly parse those files.  Of the remaining 157 files there were some GML files that failed validation and four NTF files that need to be checked.  A positive result, with 157 files to review instead of 2.6 million, some of the GML files may have shared issues.

Note 4: The software was modified to reduce false positives for this run.  The same 157 files as in Note 3 were identified, along with 602 of the previous 2214 "false positives" that require further investigation.

Note: Metrics must be registered in the metrics catalogue

Assessment of non-measurable points

For some evaluation points it makes most sense to a textual description/explanation

Technical details

Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)

The full path to the input files are listed in a text file, in HDFS.  2000 lines are passed to each mapper.

Evaluation notes

Could be such things as identified issues, workarounds, data preparation, if not already included above

An attempt was made to use SequenceFiles to ensure data locality, an issue was encountered due to the variation in sizes of data - see NTF/ISO sizes, which was resolved.  However, when creating the SequnceFile there were heap size issues/exception after a very long execution.  As the creation time for the SequenceFile was significantly longer than that of just copying the data into HDFS and processing it, that approach was abandoned.  Additionally, the data will not be stored for long term preservation in SequenceFiles.


This experiment has a heterogeneous dataset of many different file types, with all files identified (mimetype) and checksummed, with the NTF and GML files also being validated.  The final result of this evaluation, with a runtime of eight and a half hours, is a reasonable time for performing all those tasks.  However, the files are not stored permanently in HDFS and it took 24 hours for them to be copied into HDFS.  It is worth noting that although a number of the files are small (~214kb average), we were still not hitting the I/O limits of the cluster - see note 2 above.  When there are more mappers processing data we do not see linear growth, although we still see a decrease in wall-clock time (8 mappers @ 46.84GB/h would scale to 163.94GB/h if it was linear growth, instead of the recorded 123.4GB/h)

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.