
TB.WP1: Web Content Testbed - Next steps
ONB plans for the next few month (04.2012)
Order and setup a local Hadoop cluster with 5 Nodes and 2 Controllers
What we are planning to do with it:
1) Analysis of existing web content meta data
- We will receive it in txt format from our long term archiving team
- The file contains a line per object - holding meta data info received via HTTP-get.
- That is useful training to learn to know how to write map/reduce programs and how to handle data within HDFS
- We will write map/reduce code for statistical analysis on that data
2) ARC.GZ content characterization like we did it in the first year - but with Hadoop and HDFS
- We will learn how to work with Hadoop with huge amount of data.
- We will learn how to use characterization tools best on Hadoop (API vs. Command line etc....) - very PC.CC related
- We will write map/reduce code for statistical analysis on that data
We will need to strongly connect with PC.CC and PT. Because WE are the connection or integration point between the characterization components and the platform on which these components need to run on. Being the interface is an extremely important role. If that fails - nothing will work at the end of the day.
Labels:
None