Hands-on Exercise 1 : How to Write a Hadoop-WordCount

Content

  1. Overview
  2. Write a Mapper
  3. Write a Reducer
  4. Setup the Main Program
  5. Compile the program
  6. Download source code

1. Overview

WordCount is a simple program which counts the number of occurrences of each word in a given text input data set. WordCount fits very well with the MapReduce programming model making it a great example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:

You can also download the entire scope of source code and executable here.

2. Write a Mapper

A Mapper overrides the “map” function from the Class "org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs as the input. A Mapper implementation may output <key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the key would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each word in the line of text.

Pseudo-code

void Map (key, value){
        for each word x in value:
               output.collect(x, 1);
}

Detail implementation

public static class Map
       extends Mapper<LongWritable, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1); // type of output value
    private Text word = new Text();                            // type of output key
    public void map(LongWritable key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString()); // line to string token
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());    // set word as each input keyword
        context.write(word, one);     // create a pair <keyword, 1>
      }
    }
  }

3. Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word, occurrence>.

Pseudo-code

void Reduce (keyword, <list of value>){
        for each x in <list of value>:
                sum+=x;
        final_output.collect(keyword, sum);
}

Detail implementation

public static class Reduce
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    //implement the reduce function
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0; // initialize the sum for each keyword
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result); // create a pair <keyword, number of occurences>
    }
}

4. Setup the Main Program


The Main program configures and run the MapReduce job. We use the main program to perform basic configurations such as:

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); // get all args
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    } // create a job with name "wordcount"
    Job job = new Job(conf, "word count"); 
    job.setJarByClass(WordCount.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class); //set the HDFS path of the input data
    FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //set the HDFS path of the input data
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); // Wait till Job completion
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

5. Compile the program

Please enter the following commands where the “WordCount.java " located in the bash mode:

cd ~/Hadoop-WordCount
cat WordCount.java
cat build.sh
./build.sh

Here please make sure the JAVA_HOME path is correctly set in both “conf/hadoop-env.sh” and ".bashrc".

6. Download source code

Please download the entire scope of source code and executable here.

Next : Exercise 2: Running WordCount in a standalone Hadoop