For local testing, two sample Zip container files from the Govdocs1 corpus have been made available on the virtual machine.
Download Cipex, a Java application which can be used as a local java application or as a Hadoop job to identify files packaged in container files.
To execute the on the command line type:
The output is written to STDOUT and returns the identification properties in tabular form:
In the following, a performance comparison of the local execution and SCAPE instance execution at the Austrian National Library (1 controller, 5 nodes) will be demonstrated.
You can follow the experiments by following the same steps on the pseudo-distributed Hadoop on your VM.
For this demonstration 10 ZIP files of the Govdocs1 corpus have ben downloaded from:
These zip files contain a total of 9865 files and sum up to a total size of about 4,6 Gigabytes.
Execution time: 5 minutes and 7 seconds (real 5m7.343s, user 4m58.830s, sys 0m17.010s).
In order to test the same process on the Hadoop cluster, first we load the Zip files into HDFS (assuming the Zip files are stored in the directory govdocs1sample):
When starting the application as a Hadoop job, the -d parameter does not indicate the container files directory, but it must be an HDFS directory containing (the) text file(s) listing HDFS paths to container files.
For example, let us assume a textfile named zipfileshdfspaths.txt listing HDFS paths to the ZIP container file we just loaded into HDFS:
In the case of a short demo, this text file size is usually smaller than the default split size (64MB) and Hadoop would process all paths listed in one input text file in one single map task (means: one processor core of a machine!), like illustrated below:
This means: Pay attention to the split size which indicates how the input data is divided into data chunks!
In our case, this means that we have to configure the maximum number of records per tasks. In case it is not possible to configure the maximum number of records using a parameter (like on Cloudera 3u4), Hadoop can be forced to create a specific number of records per task by splitting the input text file.
Using the following command we split the input text file into 10 files each containing one single path to a zip file (-l means 1 line per file):
By that way we force Hadoop to process the records in parallel:
The splitted files are loadedinto HDFS:
The Hadoop job is executed by setting the Hadoop job flag -m where -d points to the HDFS directory containing the textfiles which list paths to the container files:
The Hadoop job processed 10 map tasks (one line per splitted file/one zip file per task) and 1 reduce task and it took about 35 seconds according to Hadoop Admin (real 0m45.404s, user 0m15.490s, sys 0m1.800s).
The Hadoop job then produces the output data similar to the local execution output table:
The cipex tool Hadoop job implements a Mapper and a Reducer (see inner classes ContainerItemIdentificationMapper and ContainerItemIdentificationReducer of the Cipex class), but it does not make use of the MapReduce programming paradigm. In this Reducer implementation the key-value pair is taken from the Mapper and forwarded without any processing.
Based on the output of the Cipex tool, the following question should be answered:
"For how many of the total number of 9865 files do the different tools agree with each other about the mimetype assignment?"
There are many approaches to answer this question using MapReduce, in the following two MapReduce jobs are proposed to answer this question.
The first job is implemented by the class eu.scape_project.tb.cipex_analyse.CipexToolsAgreeOnMimeType of the cipex-analyse tool.
divides the input lines:
and writes key-value pairs out:
The Reducer gets the key-value pairs from the Mapper and counts where the tools agree regarding the mime type identification of files:
The hadoop job execution is started with the following command (in one line):
and creates the result (snippet):
Note that this time the fully qualified classname of the main class is given after the jar file name because no main class is defined in the manifest and the package contains different main classes which can be executed as Hadoop jobs.
The final Hadoop job (see eu.scape_project.tb.cipex_analyse.CipexCountAgreeOnMimeType), is executed by the command (in one line):
whch creates the summary: