Pig Word Count Tutorial

SALSA Group
PTI Indiana University
June 29th 2012

Contents

1. Introduction

2. Prerequisite

3. Running Pig

4. Pig Word Count

1. Introduction

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. This tutorial shows you how to run Pig scripts in local mode and MapReduce mode.

2. Prerequisite

Pig works on Linux systems and you need Java, Hadoop, Pig packages to run Pig scripts. Python and JavaScript are optional components to leverage Pig advanced features.

Mandatory:

1) Java 1.6 - http://java.sun.com/javase/downloads/index.jsp
2) Hadoop 0.20.2, - http://hadoop.apache.org/common/releases.html
3) Pig 0.8.1, 0.9.0, 0.10.0 – http://pig.apache.org/releases.html#Download

Optional:

1) Python 2.5 - http://jython.org/downloads.html
2) JavaScript 1.7 - https://developer.mozilla.org/en/Rhino_downloads_archive

Configure:

1) Add /pig-0.10.0/bin to your path. Use export (bash,sh) or setenv (csh).

For example: $ export PATH=/<my-path-to-pig>/pig-0.10.0/bin:$PATH 

2) Test the Pig installation with this simple command:

$ pig -help

3. Running Pig

You can run Pig (execute Pig Latin statements and Pig commands) using various modes.

  Local Mode MapReduce Mode
Interactive Mode Yes Yes
Batch Mode Yes Yes
  • Pig has two execution modes:
  • Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag

    Sample: (pig -x local) 

    MapReduce Mode – To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. You can specify mapreduce mode using the -x flag

    Sample: (pig -x mapreduce)

  • Pig also has two invocation modes:
  • Interactive Mode - You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

    Sample:

    grunt> messages = load 'lines' using PigStorage(':');
    grunt> outputs = Foreach messages Generate $0 as ID;
    grunt> dump outputs;

    Batch Mode - You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).

    Sample:

    $pig -x local test.pig 
    $pig -x mapreduce test.pig

    4. Pig Word Count

    We show an example of classic word count application using Pig Latin.

    1. Pig Word Count Download Package
    2. Pig Word Count Scripts
    3. A = load './input.txt';
      B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
      C = group B by word;
      D = foreach C generate COUNT(B), group;
      store D into './wordcount';
    4. Running Pig Word Count Script
    5. local mode:
      bin/pig -x local wordcount.pig
      mapreduce mode:
      hadoop dfs -copyFromLocal input.txt input/input.txt
      bin/pig -x mapreduce wordcount.pig
    6. Programming Word Count using Pig
    7. Here are major steps to develop Pig word count application.

      Loads data from the file system.
      LOAD 'data' [USING function] [AS schema];
      Sample:
      records = load 'student.txt' as (name:chararray, age:int, gpa:double);
      Generates data transformations based on columns of data.
      alias  = FOREACH { gen_blk | nested_gen_blk } [AS schema];
      Sample: 
      words = foreach lins generate flatten(TOKENIZE((chararray)$0)) as word;
      Sometimes we want to eliminate nesting. This can be accomplished via the FLATTEN keyword.
      words = foreach lines generate flatten(TOKENIZE((chararray)$0)) as word;
      The GROUP operator groups together tuples that have the same group key (key field).
      alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected'] [PARALLEL n];
      Sample:
      word_groups = group words by word;
      Use the COUNT function to compute the number of elements in a bag.
      COUNT(expression)
      Sample:
      D = foreach C generate COUNT(B), group;
      The above program steps will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a text file.

    References:

    http://pig.apache.org/docs/r0.10.0/start.html
    http://pig.apache.org/docs/r0.7.0/tutorial.html
    http://hortonworks.com/