William Palmer @ BL dotUK
|Metric|| Metric baseline (14th August 2014)
||Metric goal||8th August 2014|
|TotalRuntime||22:31:31 (hh:mm:ss)|| 08:32:53 (hh:mm:ss)
|TotalObjects||2,602,737 || 2,602,737 
|ThroughputGbytesPerHour||46.83|| 123.4 
| Using 8 simultaneous map slots/nodes
|| Using 28 simultaneous map slots/nodes
Note 1:This is a subset of the main geospatial dataset, totaling approximately 1055GB. These files took 24h37m to be copied from a NAS into HDFS
Note 2: This is 35MB/s (below the benchmarked max I/O rate of between 74-146MB/s - Benchmarking Hadoop installations)
Note 3: Problems were detected in 2371 GML and NTF files, 2214 of which were false positives and GeoLint has subsequently been modified to correctly parse those files. Of the remaining 157 files there were some GML files that failed validation and four NTF files that need to be checked. A positive result, with 157 files to review instead of 2.6 million, some of the GML files may have shared issues.
Note 4: The software was modified to reduce false positives for this run. The same 157 files as in Note 3 were identified, along with 602 of the previous 2214 "false positives" that require further investigation.
Note: Metrics must be registered in the metrics catalogue
For some evaluation points it makes most sense to a textual description/explanation
Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)
The full path to the input files are listed in a text file, in HDFS. 2000 lines are passed to each mapper.
Could be such things as identified issues, workarounds, data preparation, if not already included above
An attempt was made to use SequenceFiles to ensure data locality, an issue was encountered due to the variation in sizes of data - see NTF/ISO sizes, which was resolved. However, when creating the SequnceFile there were heap size issues/exception after a very long execution. As the creation time for the SequenceFile was significantly longer than that of just copying the data into HDFS and processing it, that approach was abandoned. Additionally, the data will not be stored for long term preservation in SequenceFiles.
This experiment has a heterogeneous dataset of many different file types, with all files identified (mimetype) and checksummed, with the NTF and GML files also being validated. The final result of this evaluation, with a runtime of eight and a half hours, is a reasonable time for performing all those tasks. However, the files are not stored permanently in HDFS and it took 24 hours for them to be copied into HDFS. It is worth noting that although a number of the files are small (~214kb average), we were still not hitting the I/O limits of the cluster - see note 2 above. When there are more mappers processing data we do not see linear growth, although we still see a decrease in wall-clock time (8 mappers @ 46.84GB/h would scale to 163.94GB/h if it was linear growth, instead of the recorded 123.4GB/h)