View Source

h2. Investigator(s)

William Palmer, British Library



h2. Dataset

[SP:BL Web Archive SCAPE Testbed Dataset]





h2. Platform

[SP:BL Hadoop Platform]






h2. Workflow

The workflow has been implemented using a native Java/Hadoop application called Nanite, which was originally developed within SCAPE and has since seen further development.  Nanite uses Tika & Droid and operates directly on the content of arc/warc files using a RecordReader.

Nanite code is here: [https://github.com/openplanets/nanite]

The arc/warc files are held in HDFS
* Nanite gives one arc/warc file to a mapper, which then executes map methods on the contents using an arc/warc RecordReader.
* The Mapper currently processed each file/record from the arc/warc as follows:
** identify using Tika
** identify using Droid
** characterize using a process isolated Tika [https://github.com/willp-bl/ProcessIsolatedTika]*** output can be stored in a c3po-compatible zip (one per input arc/warc)
** file extension extracted from URI (if available)
** content-type given by the original web server (if available)
** a full list of options can be seen here: [https://github.com/openplanets/nanite/blob/master/nanite-hadoop/src/main/resources/FormatProfiler.properties]
* The Reducer sorts the output and provides a count for each occurrence of the same set of information




h2. Requirements and Policies

ReliableAndStableAssessment = Is the code reliable and robust and does it handle errors sensibly with good reporting?
NumberOfFailedFiles = 0

h2. Evaluations

{pageTree:[email protected]}