Skip to end of metadata
Go to start of metadata

Investigator(s)

Sven Schlarb

Dataset

Austrian National Library - Web Archive

Platform

ONB Hadoop Platform

Purpose of the Experiment

The purpose of the experiment is to evaluate the performance of characterising web archive data available in form of ARC container files using FITS (File Information Toolset) executed by ToMaR with subsequent ingest into a MongoDB in order to make it available to the SCAPE profiling tool C3PO.

The SCAPE Execution Platform leverages functionality of existing command line applications by using ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a computer cluster. 

FITS (File Information Tool Set) produces “normalised” output of various file format identification and characterisation tools. In order to be able to use this tool with web archive data, it is necessary to unpack the files contained in ARC or WARC container files in a first step. In a second step the FITS file format characterisation process can then be applied to the individual files using ToMaR.

Workflow

To run over all items inside the ARC.GZ files, the native JAVA map/reduce program uses a custom RecordReader based on the Hadoop 0.20 API. The custom RecordReader enables the program to read the ARC.GZ files natively and iterate over the archive file record by record (content file by content file). Each record is processed by a single map method call to detect its MIME type.

The workflow below is an integrated example of using several SCAPE outcomes in order to create a profile of web archive content. It shows the complete process from unpacking a web archive container file to viewing aggregated statistics about the individual files it contains using the SCAPE profiling tool C3PO:


Figure: Web Archive FITS Characterisation using ToMaR, available on myExperiment: www.myexperiment.org/workflows/3933

The inputs in this worklow are defined as follows:

  • “c3po_collection_name”: The name of the C3P0collection to be created.
  • “hdfs_input_path”, a Hadoop Distributed File System (HDFS) path to a directory which contains textfile(s) with absolute HDFS paths to ARC files.
  • “num_files_per_invocation”: Number of items to be processed per FITS invocation.
  • “fits_local_tmp_dir”: Local directory where the FITS output XML files will be stored

The workflow uses a Map-only Hadoop job to unpackage the ARC container files into HDFS and creates input files which subsequently can be used by ToMaR. After merging the Mapper output files into one single file (MergeTomarInput), the FITS characterisation process is launched by ToMaR as a MapReduce job. ToMaR uses an XML tool specification document which defines inputs, outputs and the execution of the tool. The tool specification document for FITS used in this experiment defines two operations, one for single file invocation, and the other one for directory invocation.

FITS comes with a command line interface API that allows a single file to be used as input to produce the FITS XML characterisation result. But if the tool were to be started from the command line for each individual file in large a web archive, the start-up time of FITS including its sub-processes would accumulate and result in a poor performance. Therefore, it comes in handy that FITS allows the definition of a directory which is traversed recursively to process each file in the same JVM context. ToMaR permits making use of this functionality by defining an operation which processes a set of input files and produces a set of output files.

The question how many files should be processed per FITS invocation was be addressed by setting up a Taverna experiment like the one shown below.

Figure 2: Wrapper workflow to produce a test series.

The workflow presented above is embedded in a new workflow in order to generate a test series. A list of 40 values, ranging from 10 to 400 in steps of 10 files to be processed per invocation is given as input to the “num_files_per_invocation” parameter. Taverna will then automatically iterate over the list of input values by combining the input values as a cross product and launching 40 workflow runs for the embedded workflow.

5 ARC container files with a total size of 481 Megabytes and 42223 individual files were used as input for this experiment. The 40 workflow cycles were completed in around 24 hours and led to the result shown in figure 3.
Figure 3: Execution time vs. number of files processed per invocation.

The experiment shows a range of values with the execution time stabilising at about 30 minutes. Additionally, the evolution of the execution time of the average and worst performing task is illustrated in figure 4 and can be taken into consideration to choose the right parameter value.
Figure 4: Average and worst performing tasks.

Based on this preparatory experiment, the parameter to set the number of lines per ToMaR invocation was set to 250.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.