Initially, we will set-up a running Hadoop instance on your local Laptop inside a virtual machine. This allows you to easily experiment with Hadoop, execute MapReduce jobs, and test workflows.
Install the VirtualBox virtualization software package on your Laptop. Obtain the SCAPE Training Virtual Machine files and store them on your Laptop. Start VirtualBox, open the SCAPE Training Virtual Machine, and start the virtual machine.
You should now see the login-screen of an Ubuntu Linux installation. Use the username: bob and password: alice as credentials to log into the virtual machine. In order to work with Hadoop, you will need a command-line terminal and a web browser. Use the corresponding icons next to the System menu on the bottom-left corner to start the corresponding applications.
Execute the following command within a terminal window:
Make sure Hadoop is running successfully. First use the jps command to determine what Java processes are running:
You should see the following processes: (ignore the process IDs)
If there are any missing processes, you will need to review logs. By default, logs are located under:
Look through the log files for the Namenode and the Jobtracker in order to ensure that HDFS and MapReduce have been successfully started. In particular, look for Exceptions that might have occurred during start-up.
We will be running the word count example using an arbitrary text file as input. For the wordcount example you will need a plain text file large enough to be interesting. For this purpose, you can download plain text books from Project Gutenberg to the local file system (an example book can also be found in the Downloads folder).
Hadoop creates also a home directory for every user on its distributed file system (HDFS). This directory is typically used to store (larger amounts of) input data to a MapReduce job. Access permission can be set similar to Unix file systems (see also http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html).
Now let's copy the books onto the HDFS file system by using the Hadoop http://hadoop.apache.org/docs/stable/file_system_shell.htmldfs command. In the command below, /home/bob/sampleBook.txt must be replaced with the path to your input file, /user/bob/sampleBook.txt specifies the path on HDFS.
To display the contents of your home directory on HDFS use the following command:
To display the content of a file on HDFS use this command:
After copying the book(s) to HDFS, execute the wordcount example:
You will see the job running now; The results from the run should be in /user/bob/books-output in our case. To download the results file execute:
Hadoop includes a great set of management web sites that will allow you to monitor jobs, check log files, browse the filesystem, etc. Let's take a moment to examine two sites.
The JobTracker allows you to see information regarding the cluster, running map and reduce tasks, node status, running and completed jobs, and you can also drill into specific node and task status. The monitoring site can be found at http://localhost:50030/jobtracker.jsp
The NameNode allows you to browse the HDFS system, view nodes, look at space consumption, and check logs. The DFSHealth site site can be found at http://localhost:50070/dfshealth.jsp
The File System Shell Guide provides a full reference of HDFS shell commands:
The Commands Guide provides a full list of Hadoop commands required to execute and monitor the system as well as individual jobs:
Using the Hadoop guidelines, you can find out many details of your Hadoop cluster. Try to accomplish the following tasks:
What is the configured capacity of your HDFS, how many data nodes are connected, and how many blocks have corrupted replicas?
Execute the MapReduce Grep Example program against the previously used input files. The Grep program is documented here: http://wiki.apache.org/hadoop/Grep
Start a Hadoop job from the command-line and, in another terminal window, monitor the default job queue and obtain the job identifier of your job.