Hands-on Exercise 4 : Running WordCount on Cluster

Content

  1. Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)
  2. Execute Hadoop-WordCount
  3. Monitoring Hadoop
  4. Check the result
  5. Finish the Map-Reduce process

1. Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)

Under the Hadoop framework directory (~/hadoop-0.20.2), type these commands to format the HDFS and start the DataNodes daemon:

cd ~/hadoop-0.20.2/bin
./hadoop namenode -format

./start-dfs.sh
# check the logs for any errors

cat ../logs/hadoop-<username>-namenode-<nodeid>.log

NOTE: If you encounter "java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion" error, then "rm -rf /tmp/*" and repeat the steps from the HDFS configuration onwards.

The start-up time of each node may be different, so , please use web browser (IE, FireFox, Safari, etc) or linux web browser (lynx) to check the HeadNode status. By default, the HDFS can be monitored on port 50070. For FutureGrid machine, the public hostname will be sXr.idp.sdsc.futuregrid.org:50070 or iXr.idp.iu.futuregrid.org, where X is your node number.

http://<public_ip_OR_public_hostname>:50070/

When the file system is operational, try few HDFS commands. Eg:

cd ~/hadoop-0.20.2/bin

./hadoop fs -ls /

Remember, all DataNodes must become ready status before starting the MapReduce framework. The next step is to start the Map-Reduce daemon:

cd ~/hadoop-0.20.2/bin
./start-mapred.sh
# check the logs for any errors

cat ../logs/hadoop-<username>-tasktracker-<nodeid>.log

Again, please make sure all the mapper are running by tracking master node with port 50030:

http://<public_ip_OR_public_hostname>:50030/

2. Execute Hadoop-WordCount

Fisrt, we need to uplaodthe input files (any text format file) into Hadoop distributed file system:

cd ~/hadoop-0.20.2/bin

./hadoop fs -put ~/Hadoop-WordCount/input/ input
./hadoop fs -ls input

Here, "~/Hadoop-WordCount/input/" is the local directory where the inputs are stored. The second "input" represents the remote destination directory on the HDFS.


After uploading the inputs into HDFS, run the previous WordCount program with the following commands. We assume you have already compiled the word count program prior to the standalone Hadoop exercise.

cd ~/hadoop-0.20.2/bin
./hadoop jar wordcount.jar WordCount input output

If Hadoop is running correctly, it will print hadoop running messages similar to the following:

10/07/15 15:40:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/07/15 15:40:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.JobClient: Running job: job_local_0001
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.MapTask: numReduceTasks: 1
10/07/15 15:40:10 INFO mapred.MapTask: io.sort.mb = 100

3. Monitoring Hadoop

We can also monitor the job status using the browser based Hadoop monitoring console.

http://<public_ip_OR_public_hostname>:50030/

4. Check the result

After finishing the Job, please use the command to check the output:

cd ~/hadoop-0.20.2/bin
./hadoop fs -ls output
./hadoop fs -cat output/*

Here, "output" is the remote directory where the result stored.

Then, the result will look like this :

you."   15
you; 1
you? 2
you?" 23
young 42

5. Finishing the Map-Reduce process

After finishing the Job, please use the command to kill the HDFS and Map-Reduce daemon:

cd ~/hadoop-0.20.2/bin
./stop-all.sh

Prev: Exercise 3: Setting up an Apache Hadoop Cluster

back to Hadoop tutorial home