Hands-on Exercise 2 : Running WordCount in a standalone Hadoop

Content

  1. Execute Standalone Hadoop-WordCount
  2. Adding a Combiner

1. Execute Standalone Hadoop-WordCount

The simplest way is to run a standalone WordCount. First, an standalone version of Hadoop (without any environment setting) is required, then, download the program and unzip it to a customized location, e.g. "~/Hadoop-WordCount". If you are using FutureGrid machine, all the packages has been pre-installed in your home directory.

Assume that:

Next, enter the following commands to run the program:

cd ~/hadoop-0.20.2-standalone/bin
cp ~/Hadoop-WordCount/wordcount.jar ~/hadoop-0.20.2-standalone/bin
# Make sure you copy the wordcount.jar in ~/hadoop-0.20.2-standalone/bin
./hadoop jar wordcount.jar WordCount ~/Hadoop-WordCount/input ~/Hadoop-WordCount/output
cd ~/Hadoop-WordCount/
cat output/part-r-00000

2. Adding a Combiner

A combiner is a reduce-type function to be used when there are map key-value pairs are not immediately written to the output. It will collect those pairs in lists and perform local aggregation of the intermediate outputs. It speeds up the process which cuts down the amount of data transferred from the Mapper to the Reducer.

public static void main(String[] args) throws Exception {
    // ....
    // uncomment this line
    job.setCombinerClass(Reduce.class);
    // ....
}
cd ~/Hadoop-WordCount/
./clean.sh
ls
# you should have only build.sh clean.sh input & WordCount.java

vi WordCount.java # uncomment the above line (line number 67) & save (ESC, :wq!)
./build.sh

cd ~/hadoop-0.20.2-standalone/bin
cp ~/Hadoop-WordCount/wordcount.jar ~/hadoop-0.20.2-standalone/bin

# Run the WordCount Program
./hadoop jar wordcount.jar WordCount ~/Hadoop-WordCount/input ~/Hadoop-WordCount/output-combiner

# Check the result
cd ~/Hadoop-WordCount/
cat output-combiner/part-r-00000

Please download the entire scope of source code and executable here.

Prev: Exercise 1: How to Write a Hadoop-WordCount

Next: Exercise 3: Setting up an Apache Hadoop Cluster