Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

TB.WP1: Web Content Testbed - Next steps

ONB Plans for the next few month (04.2012)

Order and setup a local Hadoop cluster with 5 Nodes and 2 Controllers

What we are planning to do with it:
1) Analysis of existing web content meta data

  • We will receive it in txt format from our long term archiving team
  • The file contains a line per object - holding meta data info received via HTTP-get.
  • That is useful training to learn to know how to write map/reduce programs and how to handle data within HDFS
  • We will write map/reduce code for statistical analysis on that data

2) ARC.GZ content characterization like we did it in the first year - but with Hadoop and HDFS

  • We will learn how to work with Hadoop with huge amount of data.
  • We will learn how to use characterization tools best on Hadoop (API vs. Command line etc....) - very PC.CC related
  • We will write map/reduce code for statistical analysis on that data

We will need to strongly connect with PC.CC and PT. Because WE are the connection or integration point between the characterization components and the platform on which these components need to run on. Being the interface is an extremely important role. If that fails - nothing will work at the end of the day.

Please add your thoughts and ideas

Please add your thoughts and ideas in free form and lets discuss the notes on our next TB.WP1 call!

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.