View Source

h2. Investigator(s)

Sven Schlarb

h2. Dataset

[Austrian National Library - Web Archive|SP:Austrian National Library - Web Archive]

h2. Platform

[SP:ONB Hadoop Platform]

h2. Purpose of the Experiment

The purpose of the experiment is to evaluate the performance of characterising web archive data available in form of ARC container files using [FITS|] (File Information Toolset) executed by [ToMaR|] with subsequent ingest into a MongoDB in order to make it available to the [SCAPE profiling tool C3PO|].

The SCAPE Execution Platform leverages functionality of existing command line applications by using [ToMaR|], a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a computer cluster. 

[FITS|] (File Information Tool Set) produces “normalised” output of various file format identification and characterisation tools. In order to be able to use this tool with web archive data, it is necessary to unpack the files contained in ARC or WARC container files in a first step. In a second step the FITS file format characterisation process can then be applied to the individual files using ToMaR.

h2. Workflow

To run over all items inside the ARC.GZ files, the native JAVA map/reduce program uses a custom RecordReader based on the Hadoop 0.20 API. The custom RecordReader enables the program to read the ARC.GZ files natively and iterate over the archive file record by record (content file by content file). Each record is processed by a single map method call to detect its MIME type.

The workflow below is an integrated example of using several SCAPE outcomes in order to create a profile of web archive content. It shows the complete process from unpacking a web archive container file to viewing aggregated statistics about the individual files it contains using the [SCAPE profiling tool C3PO|]:

_Figure: Web Archive FITS Characterisation using ToMaR, available on myExperiment:_ [|]

The inputs in this worklow are defined as follows:
* “c3po_collection_name”: The name of the [C3P0|]collection to be created.
* “hdfs_input_path”, a Hadoop Distributed File System (HDFS) path to a directory which contains textfile(s) with absolute HDFS paths to ARC files.
* “num_files_per_invocation”: Number of items to be processed per [FITS|] invocation.
* “fits_local_tmp_dir”: Local directory where the [FITS|] output XML files will be stored

The workflow uses [a Map-only Hadoop job|] to unpackage the ARC container files into HDFS and creates input files which subsequently can be used by [ToMaR|]. After merging the Mapper output files into one single file (MergeTomarInput), the [FITS|] characterisation process is launched by [ToMaR|] as a MapReduce job. [ToMaR|] uses an [XML tool specification|] document which defines inputs, outputs and the execution of the tool. The [tool specification document for FITS|] used in this experiment defines two operations, one for single file invocation, and the other one for directory invocation.

[FITS|] comes with a command line interface API that allows a single file to be used as input to produce the [FITS|] XML characterisation result. But if the tool were to be started from the command line for each individual file in large a web archive, the start-up time of [FITS|] including its sub-processes would accumulate and result in a poor performance. Therefore, it comes in handy that [FITS|] allows the definition of a directory which is traversed recursively to process each file in the same JVM context. [ToMaR|] permits making use of this functionality by defining an operation which processes a set of input files and produces a set of output files.

The question how many files should be processed per [FITS|] invocation was be addressed by setting up a [Taverna|] experiment like the one shown below.


_Figure 2: Wrapper workflow to produce a test series._

The workflow presented above is embedded in a new workflow in order to generate a test series. A list of 40 values, ranging from 10 to 400 in steps of 10 files to be processed per invocation is given as input to the “num_files_per_invocation” parameter. [Taverna|] will then automatically iterate over the list of input values by combining the input values as a cross product and launching 40 workflow runs for the embedded workflow.

5 ARC container files with a total size of 481 Megabytes and 42223 individual files were used as input for this experiment. The 40 workflow cycles were completed in around 24 hours and led to the result shown in figure 3. !execution_time_vs_number_of_files_per_invokation.png|border=1,width=503,height=253!
_Figure 3: Execution time vs. number of files processed per invocation._

The experiment shows a range of values with the execution time stabilising at about 30 minutes. Additionally, the evolution of the execution time of the average and worst performing task is illustrated in figure 4 and can be taken into consideration to choose the right parameter value. !average_and_worst_performing_tasks.png|border=1!
_Figure 4: Average and worst performing tasks._

Based on this preparatory experiment, the parameter to set the number of lines per ToMaR invocation was set to 250.