Hands-on Exercise 3: Running Hadoop-Blast in Distributed Hadoop

Content

  1. Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)
  2. Prepare for Hadoop-Blast
  3. Execute Hadoop-Blast
  4. Monitoring Hadoop during a running Job
  5. Finish the Map-Reduce process

1. Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)

Under the Hadoop framework directory (In FutureGrid Machine, it's "~/hadoop-0.20.2/bin"), type these commands to format the HDFS and start the DataNodes daemon:

cd ~/hadoop-0.20.2/bin
./hadoop namenode -format
./start-dfs.sh

The start-up time of each node may be different, so , please use web browser (IE, FireFox, Safari, etc), linux web browser (lynx) or check the HeadNode logs. By default, the HDFS can be monitored on port 50070. For FutureGrid machine, the public hostname will be sXr.idp.sdsc.futuregrid.org:50070 or iXr.idp.iu.futuregrid.org, where X is your node number.

http://<public_ip_OR_public_hostname>:50070/ 

OR

# check the namenode logs for any errors
cd ~/hadoop-0.20.2/bin

cat ../logs/hadoop-<username>-namenode-<nodeid>.log

Remember, all DataNodes must become ready status, or not, it will influence the entire performance. The next step is to start the Map-Reduce daemon:

cd ~/hadoop-0.20.2/bin
./start-mapred.sh

Again, please make sure all the mappers are ready to serve by tracking master node with port 50030:

http://<public_ip_OR_public_hostname>:50030/

OR

# check the tasktracker logs for any errors
cd ~/hadoop-0.20.2/bin

cat ../logs/hadoop-<username>-tasktracker-<nodeid>.log

2. Prepare for Hadoop-Blast

Once the HDFS and Map-Reduce daemons are ready to be used, the Hadoop-Blast program is ready to run. The program already store in $HADOOP_HOME/apps/Hadoop-Blast, or it can be downloaded here and unzip to other customized location.

First, we need to deploy the input files, Blast program and Database archive (BlastProgramAndDB.tar.gz) into the distributed file system. Here, $BLAST_HOME must be set within .bashrc. On FutureGrid machine, the Blast archive is located at /usr/local/Blast ($BLAST_HOME) directory:

cd ~/hadoop-0.20.2/bin
./hadoop fs -put ~/hadoop-0.20.2/apps/Hadoop-Blast/input HDFS_blast_input
./hadoop fs -ls HDFS_blast_input
./hadoop fs -copyFromLocal $BLAST_HOME/BlastProgramAndDB.tar.gz BlastProgramAndDB.tar.gz
./hadoop fs -ls BlastProgramAndDB.tar.gz

3. Execute Hadoop-Blast

After deploying those required inputs into HDFS, run the previous Hadoop-Blast program with the following commands:

cd ~/hadoop-0.20.2/bin
./hadoop jar ~/hadoop-0.20.2/apps/Hadoop-Blast/executable/blast-hadoop.jar BlastProgramAndDB.tar.gz bin/blastx /tmp/hadoop-test/ db nr HDFS_blast_input HDFS_blast_output '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'

Here is the description of the above command:

$HADOOP_HOME/bin/hadoop jar Executable BlastProgramAndDB_on_HDFS bin/blastx Local_Work_DIR db nr HDFS_Input_DIR  Unique_HDFS_Output_DIR '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'
Parameter Description
Executable The full path of the Hadoop-Blast Jar program, e.g. $HADOOP_HOME/apps/Hadoop-Blast/executable/blast-hadoop.jar
BlastProgramAndDB_on_HDFS The archive name of Blast Program and Database on HDFS, e.g. BlastProgramAndDB.tar.gz
Local_Work_DIR The local directory for storing temporary output of Blast Program, e.g. /tmp/hadoop-test/
HDFS_Input_DIR The HDFS remote directory where stored input files, e.g. HDFS_blast_input
Unique_HDFS_Output_DIR A Never used HDFS remote directory for storing output files, e.g. HDFS_blast_output

If Hadoop is running correctly, it will print hadoop running messages similar to the following:

10/07/15 15:40:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/07/15 15:40:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.JobClient: Running job: job_local_0001
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.MapTask: numReduceTasks: 1
10/07/15 15:40:10 INFO mapred.MapTask: io.sort.mb = 100

4. Monitoring Hadoop during a running Job

Whenever the Job is running or has finished, we can monitor the Job detail with using web browser. The default Job Tracking port is 50030.

http://<public_ip_OR_public_hostname>:50030/

In addition, all the outputs will stored in the HDFS output directory (e.g. HDFS_blast_output).

cd ~/hadoop-0.20.2/bin
./hadoop fs -ls HDFS_blast_output
./hadoop fs -cat HDFS_blast_output/pre_1.fa

BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297695302|ref|XP_002824885.1| 100.00 11 0 0 3 35 12 22 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677746|ref|XP_002816750.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677738|ref|XP_002816709.1| 100.00 11 0 0 3 35 18 28 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677736|ref|XP_002816708.1| 100.00 11 0 0 3 35 13 23 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297662912|ref|XP_002809930.1| 100.00 11 0 0 3 35 13 23 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297290467|ref|XP_002803717.1| 100.00 11 0 0 3 35 29 39 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269450|ref|XP_002799874.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269448|ref|XP_002799873.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269446|ref|XP_002799872.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269444|ref|XP_002799871.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269442|ref|XP_002799870.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269440|ref|XP_002799869.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|296409582|gb|ADH15624.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|296482166|gb|DAA24281.1| 100.00 11 0 0 3 35 36 46 7.0 27.7

5. Finish the Map-Reduce process

After finishing the Job, please use the command to kill the HDFS and Map-Reduce daemon:

cd ~/hadoop-0.20.2/bin
./stop-all.sh

Prev: Exercise 2: Setting up an Apache Hadoop Cluster for Hadoop-Blast

Next: Exercise 4: Programming the Hadoop-Blast