Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

TB.WP1: Web Content Testbed - Next steps

ONB Plans for the next few months (04.2012)

Order and setup a local Hadoop cluster with 5 Nodes and 2 Controllers

What we are planning to do with it:
1) Analysis of existing web content meta data

  • We will receive it in txt format from our long term archiving team
  • The file contains a line per object - holding meta data info received via HTTP-get.
  • That is useful training to learn to know how to write map/reduce programs and how to handle data within HDFS
  • We will write map/reduce code for statistical analysis on that data

2) ARC.GZ content characterization like we did it in the first year - but with Hadoop and HDFS

  • We will learn how to work with Hadoop with huge amount of data.
  • We will learn how to use characterization tools best on Hadoop (API vs. Command line etc....) - very PC.CC related
  • We will write map/reduce code for statistical analysis on that data

We will need to strongly connect with PC.CC and PT. Because WE are the connection or integration point between the characterization components and the platform on which these components need to run on. Being the interface is an extremely important role. If that fails - nothing will work at the end of the day.

Please add your thoughts and ideas

Which of your planned work should be treated by the web content test bed TB.WP1 too?
Integration topics?
Additional scenarios?

Please add your thoughts and ideas in free form and lets discuss the notes on our next TB.WP1 call!

Some personal notes on a basic Hadoop experiment using web archive meta data

The experiment was using text files (each line stores the meta data of one object) produced by a web crawler. During the experiment, java code extracts the mime type and counts all occurrences.
The experiment has been done in 3 setups on a 1,2GB test sample:

  1. Hadoop map/reduce application with separate input txt files (5035 separate file in HDFS)
  2. Hadoop map/reduce application with ONE single input txt file (1 file in HDFS [concatenation of the above 5035 files])
  3. Java application (no Hadoop) with ONE single input txt file


1. 5035 separate file in HDFS:
    Processing TIME: 1hour24min20sec (84min20sec)
    MAPs total: 5035
2. 1 concatenated file in HDFS:
    Processing TIME: 1min45sec (one minute forty five seconds)
    MAPs total: 20
3. No Hadoop:
    Processing TIME: 33seconds (thirty three seconds)

When working with Hadoop, you need to care about your input files. => BIG files enable Hadoop to perform much, much, much better. If you are working with small files, Hadoop needs to create a MAP task for each single file! On big files (assumable > Block Size) Hadoop can split MAP much better!

Then, tried to configure the job with “FileInputFormat.setMinInputSplitSize”
In the given scenario, this decreases the number of required maps from 20 to 3 and increases the performance from 1min44sec to 1min22sec. This setting can be applied to the job configuration the java code.
Looking at the results, it seems that you need at least 4 or 5 physical node machines to get at a level where you can generate performance advantages with the cluster compared to a single well equipped machine (no map/reduce cluster).

Ran that on a single 8 Core node too: Think about the main focus of your hadoop jobs. Are they I/O or processing centric. I/O centric tasks (like file identification) might perform better on multiple relatively small worker nodes with a relatively low (e.g. 4) number of CPU cores - but each with its own physical disk sub system. Compared to big worker nodes with a large (e.g. 8) number of CPU cores – but a shared disk subsystem. In the second case, the disk sub system might become a bottleneck due to the amount of requested data by the large number of cores.

Further more, a test with the Unix FILE tool, called from the map method of a Hadoop job, has been performed. Performance seems to be poor due to the generated overhead.
Need to think about how tools should be used in that environment. Tool API (e.g. Tika) usage in the map/reduce code might be the better choice. But that has to be the scope of further experimentation.....

Hadoop natively supports reading from compressed input files. But these are not splitable in most cases (like gz) - which might reduce performance because:

  • If working with one, big compressed text file --> the compressed file will be processed by a single mapper (not splitable => no parallelism).
  • If a series of txt files are stored within one compressed container, each file (which might be very small) will be processed by its own mapper (too much job creation overhead).

Both cases will prevent Hadoop from splitting the input appropriately. But in some cases that might be useful and there are alternatives(having other challenges).

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.