Hands-on 1 Run User-defined Hadoop/Twister Applications on a Sandbox Standalone Machine

 

SALSA Group
PTI Indiana University
July 13th 2012

1. Introduction

Toward this hands on, you will learn to write your own SalsaDPI configuration file and run it locally on your laptop machine. Noted that, the same java executable will be use to run on Cloud environment of Hands on 2.

Prerequisite:

Hands-on 1 gudieline:

----------Important update--------------------

Bug: We found that there is bug in the provided jar executable /root/salsaDPI/salsaDPI.jar. Note that this bug will only be shown if your configuration file doesn't setup correctly: the program runs until some ssh/sftp connection which copies files (keys, binary, inputs, etc) from local to local/FutureGrid VMs, then, you may be facing java program termination without any warnings.

Solution: Please download the latest salsaDPI.jar and put it back to /root/salsaDPI inside the VirutalBox. We provide the Linux commands to download it inside the VirtualBox Image:

$ cd ~/salsaDPI/ 
$ wget http://salsahpc.indiana.edu/ScienceCloud/apps/salsaDPI/salsaDPI.jar

You could also use shared folder to copy it back under /root/salsaDPI/ inside the VirtualBox machine.

2. Test the chef-solo command

If you are using the VirtualBox pre-packaged image, first of all, you will have to check the SalsaDPI package are located under /root/salsaDPI/

root@ubuntu:~# cd salsaDPI/

root@ubuntu:~/salsaDPI# ls -l
total 1688
drwx------ 2 727168 4002 4096 2012-07-16 22:27 apps
drwx------ 4 727168 4002 4096 2012-07-16 23:14 cloud
drwxr-xr-x 3 root root 4096 2012-07-19 20:39 hadoopCloud-0.0.1
drwx------ 2 727168 4002 4096 2012-07-16 22:30 input
-rw------- 1 727168 4002 0 2012-07-16 23:43 README
-rwxrwxr-x 1 root root 1689558 2012-07-21 22:38 salsaDPI.jar
drwx------ 4 727168 4002 4096 2012-07-16 22:59 sandbox
-rwxrwxr-x 1 root root 251 2012-07-19 20:40 solo.rb
drwxr-xr-x 3 root root 4096 2012-07-19 20:39 twisterCloud-0.0.1

Secondly, within the SalsaDPI package, you need to check a solo.rb file which stores the Chef related caches and cookbooks/recipes. Noted that you will need to change the string /u/johnny/ to your customized location. In this example, Linux vi editor is used.

root@ubuntu:~/salsaDPI# cat solo.rb
cache_type "BasicFile"
cache_options({ :path => "/root/.chef/cache/checksums", :skip_expires => true })
cookbook_path [ "/root/.chef/cookbooks/"]
role_path "/root/.chef/roles"
data_bag_path "/root/.chef/data_bags"
file_cache_path "/root/.chef/cache"

Make sure you have same content in the /root/salsaDPI/solo.rb file. Then, you can test it with running chef-solo command along with a test recipe sandboxTest. It will generate a file to /tmp/<username>_test:

root@ubuntu:~$ chef-solo -c solo.rb -r http://129.79.49.248/chef-solo.tar.gz -o recipe[sandboxTest]
[2012-07-17T03:01:57-04:00] INFO: *** Chef 0.10.10 ***
[2012-07-17T03:01:58-04:00] WARN: Run List override has been provided.
[2012-07-17T03:01:58-04:00] WARN: Original Run List: []
[2012-07-17T03:01:58-04:00] WARN: Overridden Run List: [recipe[sandboxTest]]
[2012-07-17T03:01:58-04:00] INFO: Run List is [recipe[sandboxTest]]
[2012-07-17T03:01:58-04:00] INFO: Run List expands to [sandboxTest]
[2012-07-17T03:01:58-04:00] INFO: Starting Chef Run for salsahpc.indiana.edu
[2012-07-17T03:01:58-04:00] INFO: Running start handlers
[2012-07-17T03:01:58-04:00] INFO: Start handlers complete.
[2012-07-17T03:01:58-04:00] INFO: Processing file[/tmp/root_test] action create (sandboxTest::default line 12)
[2012-07-17T03:01:58-04:00] INFO: Chef Run complete in 0.023855 seconds
[2012-07-17T03:01:58-04:00] INFO: Running report handlers
[2012-07-17T03:01:58-04:00] INFO: Report handlers complete

# check the /tmp/<username>_test file
root@ubuntu:~$ cat /tmp/root_test
This is a test of using chef-solo command with the permission of root.

If there is any error messages such as ssh error, chef-solo.tar.gz cannot be downloaded, ~/.chef/ directory isn't existed or other error messages, please see the FAQ section of this page.

3. Write your SalsaDPI configuration file for Hadoop/Twister applications

Once chef-solo command is working, you could modify the SalsaDPI configuration to run your complied Hadoop and Twister applications. For this section, we use and modify a a json format template file sandbox_hadoopTemplate.json. This configuation must be correctly filled with full path of your compiled jar executable, your program inputs and your general program arguements.

root@ubuntu:~$ vi salsaDPI/sandbox/templates/sandbox_hadoopTemplate.json

Example of Hadoop WordCount program

Before modifying the sandbox_hadoopTemplate.json, the Bold fields in this example will be replaced by user.

NOTED THAT this is an unfinished template need to be filled:

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'sandbox',
// chef-solo related parameters
'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz',
'chefSoloConfFilePath':'/root/salsaDPI/solo.rb'},

// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'/root/.ssh/id_rsa' },

// runtime softwares such as recipe[hadoopSandbox] or recipe[twisterSandbox]
'softwareRecipes':['recipe[hadoopSandbox]'], // please don't change this line

// user-defined application related parameters
'applicationParameters':{'applicationType':'Hadoop',
'localPathOfProgramBinary':'#_Full_Path_to_Program_Jar_File_#',
'localPathOfProgramInput':'#_Full_Path_to_Program_Input_#',
'localPathOfBinaryDependency':'#_Full_Path_to_Program_Dependency_#',
'programExecuteLocation':'#_Path_to_Execution_Bin_Dir_#',
'programArgs':'#_Program_Args_#'}
}

Description of sandbox configuration file:

Parameter Description
mode Execution mode, options: sandbox or cloud
chef A json object that contains sandbox mode information, chefSoloRecipeUrls and chefSoloConfFilePath
chefSoloRecipeUrls Chef online recipe package url, e.g. http://129.79.49.248/chef-solo.tar.gz
chefSoloConfFilePath Sandbox mode configuration file which contains user-level cache location, cookbooks location information and others. e.g. /root/salsaDPI/sandbox/solo.rb
ssh A json object that contains ssh information, SSHLoginUsername and SSHPrivateKeyPath
SSHLoginUsername Ssh login username, normally, it's the same working username for sandbox mode.
SSHPrivateKeyPath Full path to ssh private key.
softwareRecipes Runtime softwares such as Hadoop and Twister that will be installed to the working machine(s). Current options: recipe[hadoopSandbox], recipe[twisterSandbox], recipe[hadoopCloud], and recipe[twisterCloud]
applicationParameters A json object that contains user-defined application's information
applicationType Type of user-defined application, options: Hadoop or Twister
localPathOfProgramBinary Full path of user-defined Hadoop or Twister compiled jar executable on the working machine
localPathOfProgramInput Full path of user-defined input file on the working machine, normally, a plaintext or a *.tar.gz file
localPathOfBinaryDependency Full path of user-defined program dependency file on the working machine, such as Twister Kmeans initial cluster file
programExecuteLocation Path to Twister program execution script refer to Twister package, such as samples/wordcount/bin or samples/kmeans/bin
twisterInputFilesPreFix Twister Input files prefix. Refer to the provided package, for Twister WordCount, the file prefixed is wc_data, for Twister Kmeans is km_data.
programArgs User-defined program execution command

In addition, in order to generate a general user program arguement to be executed in a dynamic environment , SalsaDPI framework provides general variables interface for user to fill the programArgs. Detail description could be seen in the following table.

Description of general variables for programArgs of applicationParameters objects, it will be replaced by the SalsaDPI when the program is scheduled on the working node:

Variables Description
#_JAR_# The user-defined jar file name
#_JOB_ID_# The job id, normally, it is default program output directory name on the remote worker node.
#_HDFS_INPUTDIR_# Hadoop type application's HDFS input directory name
#_HDFS_OUTPUTDIR_# Hadoop type application's HDFS output directory name
#_TWISTER_INPUTDIR_# Twister type application's Input directory on the working node
#_TWISTER_OUTPUTDIR_# Twister type application's output directory on the working node
#_TWISTER_PARTITION_FILE_# Twister type application's partition file
#_BINARY_DEPENDENCY_# Full Path to twister type program dependency. Mainly, it is used for Twister Kmeans init cluster file.

After modifying the sandbox_hadoopTemplate.json, we will have the following configuration file for Sandbox Hadoop WordCount.

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'sandbox',
// chef-solo related parameters
'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz',
'chefSoloConfFilePath':'/root/salsaDPI/solo.rb'},

// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'/root/.ssh/id_rsa' },

// runtime softwares such as recipe[hadoopSandbox] or recipe[twisterSandbox]
'softwareRecipes':['recipe[hadoopSandbox]'], // please don't change this line

// user-defined application related parameters
'applicationParameters':{'applicationType':'Hadoop',
'localPathOfProgramBinary':'/root/salsaDPI/apps/hadoopWordCount.jar',
'localPathOfProgramInput':'/root/salsaDPI/input/hadoopWordCountInput.txt',
'localPathOfBinaryDependency':'',
'programExecuteLocation':'',
'programArgs':'bin/hadoop jar #_JAR_# #_HDFS_INPUTDIR_# #_HDFS_OUTPUTDIR_#'}
}

 

4. Execute SalsaDPI with a user-defined application

Execute the salsaDPI jar executable with running sandbox Hadoop WordCount:

root@ubuntu:~$ cd salsaDPI
root@ubuntu:salsaDPI$ cp sandbox/templates/sandbox_hadoopTemplate.json sandbox/templates/sandbox_hadoopWordCount.json
root@ubuntu:salsaDPI$ java -cp salsaDPI.jar cgl.salsa.salsadpi.Driver sandbox/templates/sandbox_hadoopWordCount.json

After the program finishes running, the output will copy to the working directory under <workingDir>/salsaDPI_output/<job_uuid>/output/*:

root@ubuntu:salsaDPI$ ls -l salsaDPI_output/1322fb55-650e-4f6b-8f90-45f7418bda08/output/ 

-rw-r--r-- 1 root root 1396 Jul 12 20:00 1322fb55-650e-4f6b-8f90-45f7418bda08.txt

5. Home Execrise Hadoop/Twister Kmeans

Based on the above example, please try to use the provided Hadoop template sandbox_hadoopTemplate.json and Twister template sandbox_twisterTemplate.json, modify them and run a sandbox mode Hadoop/Twister Kmeans.

The main difference for writting a Hadoop / Twister Kmeans configuration file is to change applicationParameters object, we here provide the following hints:

Example solution for sandbox Hadoop Kmeans

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'sandbox',
// chef-solo related parameters
'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz',
'chefSoloConfFilePath':'/root/salsaDPI/solo.rb'},

// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'/root/.ssh/id_rsa' },

// runtime softwares such as recipe[hadoopSandbox] or recipe[twisterSandbox]
'softwareRecipes':['recipe[hadoopSandbox]'], // please don't change this line


'applicationParameters': {
'applicationType':'Hadoop',
'localPathOfProgramBinary':'/root/salsaDPI/apps/hadoopKmeans.jar',
'localPathOfProgramInput':'',
'localPathOfProgramDB':'',
'localPathOfBinaryDependency':'',
'programExecuteLocation':'',
'programArgs':'bin/hadoop jar #_JAR_# 500 10 8 3 #_JOB_ID_# > ~/#_JOB_ID_#/#_JOB_ID_#.txt' }
}  

Example solution for sandbox Twister Kmeans

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'sandbox',
// chef-solo related parameters
'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz',
'chefSoloConfFilePath':'/root/salsaDPI/solo.rb'},

// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'/root/.ssh/id_rsa' },

// runtime softwares such as recipe[hadoopSandbox] or recipe[twisterSandbox]
'softwareRecipes':['recipe[twisterSandbox]'], // please don't change this line


'applicationParameters': {
'applicationType':'Twister',
'localPathOfProgramBinary':'/root/salsaDPI/apps/Twister-Kmeans-0.9.jar', 
'localPathOfProgramInput':'/root/salsaDPI/input/twisterKmeansInput.tar.gz', 
'localPathOfBinaryDependency':'/root/salsaDPI/input/twisterKmeans_init_clusters.txt', 
'programExecuteLocation':'samples/kmeans/bin',
'twisterInputFilesPreFix':'km_data', 
'programArgs':'./run_kmeans.sh #_BINARY_DEPENDENCY_# 80 #_TWISTER_PARTITION_FILE_# > #_TWISTER_OUTPUTDIR_#/#_JOB_ID_#.txt' } 
}

Next, you can go to FutureGrid Eucalyptus preparetion.

FAQ

Please see FAQ if you have any problem.

Reference