View Source

{color:#333333}Initially, we will set-up a running Hadoop instance on your local Laptop inside a virtual machine. This allows you to easily experiment with Hadoop, execute MapReduce jobs, and test workflows.{color}

h3. Starting the Virtual Machine

{color:#333333}Install the VirtualBox virtualization software package on your Laptop. Obtain the SCAPE Training Virtual Machine files and store them on your Laptop. Start VirtualBox, open the SCAPE Training Virtual Machine, and start the virtual machine.{color}

h3. Logging into the Virtual Machine

You should now see the login-screen of an Ubuntu Linux installation. Use the username: _bob_ and password: _alice_ as credentials to log into the virtual machine. In order to work with Hadoop, you will need a command-line terminal and a web browser. Use the corresponding icons next to the _System_ menu on the bottom-left corner to start the corresponding applications.

h3. Starting Hadoop

Execute the following command within a terminal window:

[email protected]:~$ cd /usr/local/hadoop

[email protected]:~$ bin/

Make sure Hadoop is running successfully. First use the jps command to determine what Java processes are running:

[email protected]:~$ jps
You should see the following processes: (ignore the process IDs)

2474 SecondaryNameNode

2933 Jps

2113 NameNode

2548 JobTracker

2724 TaskTracker

2293 DataNode

If there are any missing processes, you will need to review logs. By default, logs are located under:

Look through the log files for the _Namenode_ and the _Jobtracker_ in order to ensure that HDFS and MapReduce have been successfully started. In particular, look for Exceptions that might have occurred during start-up.

h3. Run your first Job

We will be running the word count example using an arbitrary text file as input. For the wordcount example you will need a plain text file large enough to be interesting. For this purpose, you can download plain text books from Project Gutenberg to the local file system (an example book can also be found in the Downloads folder).

Hadoop creates also a home directory for every user on its distributed file system (HDFS). This directory is typically used to store (larger amounts of) input data to a MapReduce job. Access permission can be set similar to Unix file systems (see also []).

Now let's copy the books onto the HDFS file system by using the Hadoop [] command. In the command below, /home/bob/sampleBook.txt must be replaced with the path to your input file, /user/bob/sampleBook.txt specifies the path on HDFS.  

[email protected]:~$ hadoop dfs -copyFromLocal /home/bob/sampleBook.txt /user/bob/sampleBook.txt
{color:#000000}To display the contents of your home directory on HDFS use the following command:{color}

[email protected]:~$ hadoop dfs -ls /user/bob
{color:#000000}To display the content of a file on HDFS use this command:{color}

[email protected]:~$ hadoop dfs -cat /user/bob/sampleBook.txt
{color:#000000}After copying the book(s) to HDFS, execute the wordcount example:{color}

[email protected]:~$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar wordcount /user/bob/books /user/bob/books-output
{color:#000000}You will see the job running now; The results from the run should be in /user/bob/books-output in our case. To download the results file execute:{color}

[email protected]:~$ hadoop dfs -copyToLocal /user/bob/books-output/part-r-00000 output.txt

h3. Monitoring HDFS/Jobs

Hadoop includes a great set of management web sites that will allow you to monitor jobs, check log files, browse the filesystem, etc. Let's take a moment to examine two sites.

The JobTracker allows you to see information regarding the cluster, running map and reduce tasks, node status, running and completed jobs, and you can also drill into specific node and task status. The monitoring site can be found at [http://localhost:50030/jobtracker.jsp]

The NameNode allows you to browse the HDFS system, view nodes, look at space consumption, and check logs. The DFSHealth site site can be found at [http://localhost:50070/dfshealth.jsp]

h3. Hadoop Command Reference

The File System Shell Guide provides a full reference of HDFS shell commands:


The Commands Guide provides a full list of Hadoop commands required to execute and monitor the system as well as individual jobs:


h3. Using the Guidelines

Using the Hadoop guidelines, you can find out many details of your Hadoop cluster. Try to accomplish the following tasks:
What is the configured capacity of your HDFS, how many data nodes are connected, and how many blocks have corrupted replicas?

Execute the MapReduce Grep Example program against the previously used input files. The Grep program is documented here: []

Start a Hadoop job from the command-line and, in another terminal window, monitor the default job queue and obtain the job identifier of your job.