Hands-on Exercise 2 : Setting up an Apache Hadoop Cluster

In this section, we will deploy a simplest hadoop cluster environment. Here, we are going to setup a fully-distributed hadoop with Linux Shared File System. Hadoop can also deploy our setting on a normal independent file system, but it require to update changes to each node whenever the package is modified.

Please download hadoop 0.20.2 here

Content

  1. Create unique output directories for HDFS and MapReduce Daemon
  2. Configuring HDFS in a cluster
  3. Configuring Hadoop MapReduce Daemon in a cluster

1. Create unique job directories for HDFS and MapReduce Daemon

First create the following Job directories in a local disk parition to store HDFS and MapReduce local data. If "hadoop-test" directory is already there create another directory with a different name (eg: hadoop-test1) and substitute that name for "hadoop-test" in all the following commands.

cd /tmp
mkdir hadoop-test
cd hadoop-test
mkdir data local name

2. Configuring HDFS in a cluster

Then, we need to change the following configuration files under hadoop home directory on the MasterNode:

If it hasn't set the JAVA_HOME with this file, please add the installed Java path to it:

//export JAVA_HOME=${Your JAVA HOME PATH}
export JAVA_HOME=/etc/java/jdk1.6.0_12

Notes for MAC OS users

JAVA_HOME setting is different in MAC OS, please use the following line:

//export JAVA_HOME=${Your JAVA HOME PATH}
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home

Place the IP of the MasterNode/ NameNode in a single line. In this example, we use 149.79.89.113. (localhost will work for the current hands-on as our Hadoop cluster has only a single node)

149.79.89.113

Add all related IPs of Workers/Slaves one per line. In this example, we use 149.79.89.113, 149.79.89.114, 149.79.89.115 and 149.79.89.116. (localhost will work for the current hands-on as our Hadoop cluster has only a single node)

149.79.89.113
149.79.89.114
149.79.89.115
149.79.89.116

Within this file, we need to override the name of the default file system, "fs.default.name". A URI or IP address (localhost will work for the current hands-on as our Hadoop cluster has only a single node) with a port number is needed.

Between <configuration> and </configuration>, add:

<property>
   <name>fs.default.name</name>
   <!-- URL of MasterNode/NameNode, e.g. hdfs://localhost:9000/-->
   <value>hdfs://149.79.89.113:9000/</value>
</property>

Within this file, we need to set on where DFS (local filesystem) name node should store the name table (fsimage) and DFS data node should stored the blocks (actual data blocks).

Between <configuration> and </configuration>, add:

<property>
   <name>dfs.name.dir</name>
   <!-- Path to store namespace and transaction logs, e.g. /tmp/hadoop-test/name-->
   <value>/tmp/hadoop-test/name</value>
</property>
<property>
   <name>dfs.data.dir</name>
   <!-- Path to store namespace and transaction logs, e.g. /tmp/hadoop-test/name-->
   <value>/tmp/hadoop-test/data</value>
</property>

3. Configuring Hadoop MapReduce Daemon in a cluster

In addition, we need to modify the configuration file to set the host and port for JobTracker:

Between <configuration> and </configuration>, add:

<property>
   <name>mapred.job.tracker</name>
   <!-- IP/Hostname:Port for Hadoop JobTracker, e.g. localhost:9001 -->
   <value>149.79.89.113:9001</value>
</property>
<property>
   <name>mapred.local.dir</name>
   <!-- data node's local tmp directory -->
   <value>/tmp/hadoop-test/local</value>
</property>
<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <!-- maximum map tasks per node, please set it as same as the amount of cores (cpu)-->
   <value>8</value>
</property>

Prev: Exercise 1: Blast installation

Next: Exercise 3: Running Hadoop-Blast in Distributed Hadoop