Investigator(s)
William Palmer, British Library
Dataset
BL Web Archive SCAPE Testbed Dataset
Platform
Workflow
The workflow has been implemented using a native Java/Hadoop application called Nanite, which was originally developed within SCAPE and has since seen further development. Nanite uses Tika & Droid and operates directly on the content of arc/warc files using a RecordReader.
Nanite code is here: https://github.com/openplanets/nanite
The arc/warc files are held in HDFS
- Nanite gives one arc/warc file to a mapper, which then executes map methods on the contents using an arc/warc RecordReader.
- The Mapper currently processed each file/record from the arc/warc as follows:
- identify using Tika
- identify using Droid
- characterize using a process isolated Tika https://github.com/willp-bl/ProcessIsolatedTika
*** output can be stored in a c3po-compatible zip (one per input arc/warc)
- file extension extracted from URI (if available)
- content-type given by the original web server (if available)
- a full list of options can be seen here: https://github.com/openplanets/nanite/blob/master/nanite-hadoop/src/main/resources/FormatProfiler.properties
- The Reducer sorts the output and provides a count for each occurrence of the same set of information
Requirements and Policies
ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0
Evaluations
Labels:
None
Page:
EVAL-BL-WCT-01