Skip to end of metadata
Go to start of metadata

TB.WP1: Web Content Testbed - notes and next steps

ONB Plans for the next few months (04.2012)

Order and setup a local Hadoop cluster with 5 Nodes and 2 Controllers

What we are planning to do with it:
1) Analysis of existing web content meta data (Link to scenario)

  • We will receive it in txt format from our long term archiving team
  • The file contains a line per object - holding meta data info received via HTTP-get.
  • That is useful training to learn to know how to write map/reduce programs and how to handle data within HDFS
  • We will write map/reduce code for statistical analysis on that data

2) ARC.GZ content characterization like we did it in the first year - but with Hadoop and HDFS (Link to scenario)

  • We will learn how to work with Hadoop with huge amount of data.
  • We will learn how to use characterization tools best on Hadoop (API vs. Command line etc....) - very PC.CC related
  • We will write map/reduce code for statistical analysis on that data

We will need to strongly connect with PC.CC and PT. Because WE are the connection or integration point between the characterization components and the platform on which these components need to run on. Being the interface is an extremely important role. If that fails - nothing will work at the end of the day.

Please add your thoughts and ideas

Which of your planned work should be treated by the web content test bed TB.WP1 too?
Integration topics?
Additional scenarios?

Please add your thoughts and ideas in free form and lets discuss the notes on our next TB.WP1 call!

Some personal notes on a basic Hadoop experiment using web archive meta data

The experiment was using text files (each line stores the meta data of one object) produced by a web crawler. During the experiment, java code extracts the mime type and counts all occurrences.
The experiment has been done in 4 setups on a 1,2GB test sample:

  1. Hadoop map/reduce application with separate input txt files (5035 separate file in HDFS)
  2. Hadoop map/reduce application with ONE single input txt file (1 file in HDFS [concatenation of the above 5035 files])
  3. Java application (no Hadoop) with ONE single input txt file
  4. PIG implementation with ONE single input txt file (1 file in HDFS [concatenation of the above 5035 files])


1. 5035 separate file in HDFS:
    Processing TIME: 1hour24min20sec (84min20sec)
    MAPs total: 5035
2. JAVA map/reduce - 1 concatenated file in HDFS:
    Processing TIME: 1min45sec (one minute forty five seconds)
    MAPs total: 20
3. No Hadoop:
    Processing TIME: 33seconds (thirty three seconds)
4. PIG (0.10.0) map/reduce - 1 concatenated file in HDFS:
    Processing TIME: 3min23sec (three minutes twenty three seconds)
    MAPs total: 20

When working with Hadoop, you need to care about your input files. => BIG files enable Hadoop to perform much, much, much better. If you are working with small files, Hadoop needs to create a MAP task for each single file! On big files (assumable > Block Size) Hadoop can split MAP much better!

Then, tried to configure the job with “FileInputFormat.setMinInputSplitSize”
In the given scenario, this decreases the number of required maps from 20 to 3 and increases the performance from 1min44sec to 1min22sec. This setting can be applied to the job configuration the java code.
Looking at the results, it seems that you need at least 4 or 5 physical node machines to get at a level where you can generate performance advantages with the cluster compared to a single well equipped machine (no map/reduce cluster).

Ran that on a single 8 Core node too: Think about the main focus of your hadoop jobs. Are they I/O or processing centric. I/O centric tasks (like file identification) might perform better on multiple relatively small worker nodes with a relatively low (e.g. 4) number of CPU cores - but each with its own physical disk sub system. Compared to big worker nodes with a large (e.g. 8) number of CPU cores – but a shared disk subsystem. In the second case, the disk sub system might become a bottleneck due to the amount of requested data by the large number of cores.

Further more, a test with the Unix FILE tool, called from the map method of a Hadoop job, has been performed. Performance seems to be poor due to the generated overhead.
Need to think about how tools should be used in that environment. Tool API (e.g. Tika) usage in the map/reduce code might be the better choice. But that has to be the scope of further experimentation.....

Hadoop natively supports reading from compressed input files. But these are not splitable in most cases (like gz) - which might reduce performance because:

  • If working with one, big compressed text file --> the compressed file will be processed by a single mapper (not splitable => no parallelism).
  • If a series of txt files are stored within one compressed container, each file (which might be very small) will be processed by its own mapper (too much job creation overhead).

Both cases will prevent Hadoop from splitting the input appropriately.

On a 8 node basic experimental cluster (poor hardware, 100Mbit connection):

Hadoop: 1,2 GB - 1 single txt files containing web archive meta data

  • Processing TIME: 1min34sec (one min thirty four seconds)
  • MAP total: 20

Comment: This is not really faster than on the pseudo distributed cluster (with one node). BUT...

Hadoop: 3,6 GB - 3 txt files containing web archive meta data

  • Processing TIME: 1min47sec (one min forty seven seconds)
  • MAP total: 60

Comment: ...tripling the input data from 1,2GB to 3,6 GB increases the processing time only by 13 seconds (14%).

Hadoop: 197 GB -  txt web archive meta data

  • Processing TIME: 35min34sec (thirty five min thirty four seconds)
  • MAP total: 3167
  • REDUCE total: 25

Very interesting to compare the results of the 3 experiments running on the same cluster setup with the same java map/reduce code:

            GB Input    Seconds     GB / min

Test1:      1,2         94               0,76

Test2:      3,6         107             2,01

Test3:      197         2135          5,54

The performance gain is only generated over the amount of processed data.

The bigger the data, the less the influence of the job starting overhead. Very interesting to see that reflected on the own experiments - and to see the amount of the influence on performance.

The 197 GB experiment described above, is an comparable equivalent to a task we regularly also perform in a "traditional way" (as a java program running on a 4 core HT server). Usually this task takes around 6 hours and 45 minutes to complete in the traditional way. Compared to the 36 minutes of the above mentioned experiment, that is an excellent value for our "poor hardware cluster".

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.