Skip to end of metadata
Go to start of metadata

Evaluator(s)

William Palmer, British Library

Evaluation points

Assessment of measurable points
Metric Metric goal
Early Jan 2014 (Metric baseline) Late Jan 2014 (incl CloseShield patch) July 2014
 
TotalRuntime   31:33:00 17:32:00 54:11:15 [0]  
TotalSize
  1024GB
1024GB 1024GB  
NumberOfObjectsPerHour   2956292.8[1]
5319641.0 [1] 1923994 [2]
object = file in arc
NumberOfObjectsPerHour   457.8 823.9 266.57 object = arc file
ThroughputGbytesPerHour   32.4[3]
58.4[3] 18.89 [3]  
ReliableAndStableAssessment TRUE TRUE TRUE TRUE  
NumberOfFailedFiles 0 0 0 0  
NumberOfFailedFilesAcceptable - TRUE TRUE TRUE  
Options enabled:
   
INCLUDE_SERVERTYPE
USE_DROID
USE_TIKADETECT
USE_TIKAPARSER
GENERATE_C3PO_ZIP
 

  This test run was identification (i.e.mimetype detection only)
This test run was identification (i.e.mimetype detection only) This test run was identification and characterisation (i.e.metadata extraction)  

The test in late January was following work to increase the speed of Nanite.  The test in July extended the previous work, and added characterisation and a series of other options and outputs.

[0] This run was longer than the previous one, because it also produced characterization metadata for each file, unlike earlier tests which only produced a mime type.

[1] In total, 93,271,038 map input records were processed

[2] In total, 104,256,430 map input records were processed (for the same ARC input files)  This is most likely due to a version upgrade of the record readers used by Nanite, but requires further investigation.  See the results in the table below

[3] Note that this is throughput in relation to the compressed sizes of ARC files

For the July 2014 characterisation run, the following exceptions/errors were recorded in the logs:

Type
Count
Potentially malformed records
3
Java (non-IO)Exceptions thrown
4779
Java IOExceptions thrown
1865
Assessment of non-measurable points

For some evaluation points it makes most sense to a textual description/explanation

Please include a note about goals-objectives omitted, and why.

Technical details

Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)

Jan evaluations: patches upstream now: https://github.com/openplanets/nanite/releases/tag/nanite-1.0.72.2 (need to re-enable nanite-hadoop) - this code is for the Late Jan test (includes CloseShield patch)

July evaluations: code is here: https://github.com/openplanets/nanite/tree/7ed4d5536c42ff77367a10fb9671d5fab2a6935d

Evaluation notes

Could be such things as identified issues, workarounds, data preparation, if not already included above

Some warc files were truncated/zero length, probably due to issues when being copied - a small Hadoop program was written to identify these files so they could be excluded from the full runs.  Runtime of the check is very quick.  This can be chained before the FormatProfiler MapReduce program in Nanite, but it is turned off and not included in these evaluation runs as we have already identified the problematic files and runtime is very short.  (see: https://github.com/willp-bl/nanite/tree/master/nanite-hadoop/src/main/java/uk/bl/wap/hadoop/gzchecker)

For the characterization run in July 2014, more files were processed from within the ARC files than in previous evaluation runs.  This is most likely due to an upgrade of dependencies, for example the warc-hadoop-recordreaders, amongst others.  However, a number of exceptions/errors remained and this issue should be looked in to, so we can ensure that all files within the ARC files are processed in future.

Conclusion

The decision to use and develop Nanite further for this experiment has proved to have been a sound one.  Nanite benefits greatly due to being tightly coupled with Hadoop, and making use of pure-Java libraries so no external applications are called.  After initially reducing the runtime by almost 50%, further work was undertaken to add in full characterisation of the input files, which proved to be very performant and compared favourably to other methods of characterisation at scale.  Nanite is a good base for future work on gleaning more information from web archives and can be easily extended further.  An example of this is the c3po compatible outputs for exploring the characterisation information of ones archives.  Additional options for storing files that Tika cannot process are already included and will potentially be useful for improving Tika.

One of our web archive collections totals 30TB of compressed (W)ARC files, and using Nanite to characterise that data on the same test cluster would be expected to take 68 days, which is acceptable.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.