- Introduction to Hadoop (video) (PDF Slides)
- here for more, especially the training ones at the end of the page.
- Get the VMWare or VirtualBox disk image from Andy.
- Fire it up. (username/pw cloudera/cloudera).
- Double click on 'Link to Getting Familiar with Hadoop'.
- Follow the instructions (although please skip the 'update exercises' bit).
- NOTE that HADOOP_HOME should be set to /usr/lib/hadoop - this is not set by default!
- Fire up Eclipse.
- Walk through the example section of this tutorial using the Average Word Length code as base, aiming to run it on the Shakespeare file you used in the previous exercise.
- See http://stackoverflow.com/questions/2627389/how-to-learn-using-hadoop for some more pointers.
Here's a few ideas for more advanced things to do.
- Fire up Firefox and have a look at the Hue, HBase Master, NameNode Status and JobTracker Status pages in the bookmarks bar.
- Modify the code from exercise 2 to parse the MIME types from the sample web crawler log supplied in ~/scape/sample.log and produce a format profile.
- Generate a sequence file, perhaps using forqlift, and do something clever like run DROID 6 on it.
We need something like HBase because HDFS does not cope well with lots of 'small' files (due to the HDFS block size). See http://www.cloudera.com/blog/2009/02/the-small-files-problem/ for information and some alternative solutions.