Hands-on Exercise 3: Running Hadoop-Blast in Distributed Hadoop
Content
- Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)
- Prepare for Hadoop-Blast
- Execute Hadoop-Blast
- Monitoring Hadoop during a running Job
- Finish the Map-Reduce process
1. Run Haoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)
Under the Hadoop framework directory (In FutureGrid Machine, it's "~/hadoop-0.20.2/bin"), type these commands to format the HDFS and start the DataNodes daemon:
cd ~/hadoop-0.20.2/bin
./hadoop namenode -format
./start-dfs.sh
The start-up time of each node may be different, so , please use web browser (IE, FireFox, Safari, etc), linux web browser (lynx) or check the HeadNode logs. By default, the HDFS can be monitored on port 50070. For FutureGrid machine, the public hostname will be sXr.idp.sdsc.futuregrid.org:50070 or iXr.idp.iu.futuregrid.org, where X is your node number.
http://<public_ip_OR_public_hostname>:50070/
OR
# check the namenode logs for any errors
cd ~/hadoop-0.20.2/bin
cat ../logs/hadoop-<username>-namenode-<nodeid>.log
Remember, all DataNodes must become ready status, or not, it will influence the entire performance. The next step is to start the Map-Reduce daemon:
cd ~/hadoop-0.20.2/bin
./start-mapred.sh
Again, please make sure all the mappers are ready to serve by tracking master node with port 50030:
http://<public_ip_OR_public_hostname>:50030/
OR
# check the tasktracker logs for any errors
cd ~/hadoop-0.20.2/bin
cat ../logs/hadoop-<username>-tasktracker-<nodeid>.log
2. Prepare for Hadoop-Blast
Once the HDFS and Map-Reduce daemons are ready to be used, the Hadoop-Blast program is ready to run. The program already store in $HADOOP_HOME/apps/Hadoop-Blast, or it can be downloaded here and unzip to other customized location.
First, we need to deploy the input files, Blast program and Database archive (BlastProgramAndDB.tar.gz) into the distributed file system. Here, $BLAST_HOME must be set within .bashrc. On FutureGrid machine, the Blast archive is located at /usr/local/Blast ($BLAST_HOME) directory:
cd ~/hadoop-0.20.2/bin
./hadoop fs -put ~/hadoop-0.20.2/apps/Hadoop-Blast/input HDFS_blast_input
./hadoop fs -ls HDFS_blast_input
./hadoop fs -copyFromLocal $BLAST_HOME/BlastProgramAndDB.tar.gz BlastProgramAndDB.tar.gz
./hadoop fs -ls BlastProgramAndDB.tar.gz
- Line 2 push all the blast input files (FASTA formatted queries) into HDFS “input” directory from local disk.
- Line 3 list the pushed file on HDFS remote directory "HDFS_blast_input"
- Line 4 copies the Blast program and database archive (BlastProgramAndDB.tar.gz) from $BLAST_HOME to the HDFS as distributed caches which will be used later.
- Line 5 double check the pushed Blast program and database archive "BlastProgramAndDB.tar.gz" on HDFS
3. Execute Hadoop-Blast
After deploying those required inputs into HDFS, run the previous Hadoop-Blast program with the following commands:
cd ~/hadoop-0.20.2/bin
./hadoop jar ~/hadoop-0.20.2/apps/Hadoop-Blast/executable/blast-hadoop.jar BlastProgramAndDB.tar.gz bin/blastx /tmp/hadoop-test/ db nr HDFS_blast_input HDFS_blast_output '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'
Here is the description of the above command:
$HADOOP_HOME/bin/hadoop jar Executable BlastProgramAndDB_on_HDFS bin/blastx Local_Work_DIR db nr HDFS_Input_DIR Unique_HDFS_Output_DIR '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'
| Parameter | Description |
|---|---|
| Executable | The full path of the Hadoop-Blast Jar program, e.g. $HADOOP_HOME/apps/Hadoop-Blast/executable/blast-hadoop.jar |
| BlastProgramAndDB_on_HDFS | The archive name of Blast Program and Database on HDFS, e.g. BlastProgramAndDB.tar.gz |
| Local_Work_DIR | The local directory for storing temporary output of Blast Program, e.g. /tmp/hadoop-test/ |
| HDFS_Input_DIR | The HDFS remote directory where stored input files, e.g. HDFS_blast_input |
| Unique_HDFS_Output_DIR | A Never used HDFS remote directory for storing output files, e.g. HDFS_blast_output |
If Hadoop is running correctly, it will print hadoop running messages similar to the following:
10/07/15 15:40:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/07/15 15:40:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.JobClient: Running job: job_local_0001
10/07/15 15:40:10 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/15 15:40:10 INFO mapred.MapTask: numReduceTasks: 1
10/07/15 15:40:10 INFO mapred.MapTask: io.sort.mb = 100
4. Monitoring Hadoop during a running Job
Whenever the Job is running or has finished, we can monitor the Job detail with using web browser. The default Job Tracking port is 50030.
http://<public_ip_OR_public_hostname>:50030/
In addition, all the outputs will stored in the HDFS output directory (e.g. HDFS_blast_output).
cd ~/hadoop-0.20.2/bin
./hadoop fs -ls HDFS_blast_output
./hadoop fs -cat HDFS_blast_output/pre_1.fa
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297695302|ref|XP_002824885.1| 100.00 11 0 0 3 35 12 22 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677746|ref|XP_002816750.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677738|ref|XP_002816709.1| 100.00 11 0 0 3 35 18 28 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297677736|ref|XP_002816708.1| 100.00 11 0 0 3 35 13 23 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297662912|ref|XP_002809930.1| 100.00 11 0 0 3 35 13 23 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297290467|ref|XP_002803717.1| 100.00 11 0 0 3 35 29 39 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269450|ref|XP_002799874.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269448|ref|XP_002799873.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269446|ref|XP_002799872.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269444|ref|XP_002799871.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269442|ref|XP_002799870.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|297269440|ref|XP_002799869.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|296409582|gb|ADH15624.1| 100.00 11 0 0 3 35 11 21 7.0 27.7
BG3:2_30MNAAAXX:7:1:981:1318/1 gi|296482166|gb|DAA24281.1| 100.00 11 0 0 3 35 36 46 7.0 27.7
5. Finish the Map-Reduce process
After finishing the Job, please use the command to kill the HDFS and Map-Reduce daemon:
cd ~/hadoop-0.20.2/bin
./stop-all.sh
Prev: Exercise 2: Setting up an Apache Hadoop Cluster for Hadoop-Blast