William Palmer, British Library
|Metric|| Metric goal
||Early Jan 2014 (Metric baseline)||Late Jan 2014 (incl CloseShield patch)|| July 2014
||5319641.0 || 1923994 
||object = file in arc|
|NumberOfObjectsPerHour||457.8||823.9||266.57||object = arc file|
| Options enabled:
|| This test run was identification (i.e.mimetype detection only)
||This test run was identification (i.e.mimetype detection only)||This test run was identification and characterisation (i.e.metadata extraction)|
The test in late January was following work to increase the speed of Nanite. The test in July extended the previous work, and added characterisation and a series of other options and outputs.
 This run was longer than the previous one, because it also produced characterization metadata for each file, unlike earlier tests which only produced a mime type.
 In total, 93,271,038 map input records were processed
 In total, 104,256,430 map input records were processed (for the same ARC input files) This is most likely due to a version upgrade of the record readers used by Nanite, but requires further investigation. See the results in the table below
 Note that this is throughput in relation to the compressed sizes of ARC files
For the July 2014 characterisation run, the following exceptions/errors were recorded in the logs:
| Potentially malformed records
| Java (non-IO)Exceptions thrown
| Java IOExceptions thrown
For some evaluation points it makes most sense to a textual description/explanation
Please include a note about goals-objectives omitted, and why.
Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)
Jan evaluations: patches upstream now: https://github.com/openplanets/nanite/releases/tag/nanite-126.96.36.199 (need to re-enable nanite-hadoop) - this code is for the Late Jan test (includes CloseShield patch)
July evaluations: code is here: https://github.com/openplanets/nanite/tree/7ed4d5536c42ff77367a10fb9671d5fab2a6935d
Could be such things as identified issues, workarounds, data preparation, if not already included above
Some warc files were truncated/zero length, probably due to issues when being copied - a small Hadoop program was written to identify these files so they could be excluded from the full runs. Runtime of the check is very quick. This can be chained before the FormatProfiler MapReduce program in Nanite, but it is turned off and not included in these evaluation runs as we have already identified the problematic files and runtime is very short. (see: https://github.com/willp-bl/nanite/tree/master/nanite-hadoop/src/main/java/uk/bl/wap/hadoop/gzchecker)
For the characterization run in July 2014, more files were processed from within the ARC files than in previous evaluation runs. This is most likely due to an upgrade of dependencies, for example the warc-hadoop-recordreaders, amongst others. However, a number of exceptions/errors remained and this issue should be looked in to, so we can ensure that all files within the ARC files are processed in future.
The decision to use and develop Nanite further for this experiment has proved to have been a sound one. Nanite benefits greatly due to being tightly coupled with Hadoop, and making use of pure-Java libraries so no external applications are called. After initially reducing the runtime by almost 50%, further work was undertaken to add in full characterisation of the input files, which proved to be very performant and compared favourably to other methods of characterisation at scale. Nanite is a good base for future work on gleaning more information from web archives and can be easily extended further. An example of this is the c3po compatible outputs for exploring the characterisation information of ones archives. Additional options for storing files that Tika cannot process are already included and will potentially be useful for improving Tika.
One of our web archive collections totals 30TB of compressed (W)ARC files, and using Nanite to characterise that data on the same test cluster would be expected to take 68 days, which is acceptable.