Hands-on 2 Run User-defined Hadoop/Twister Applications on FutureGrid Eucalyptus

 

SALSA Group
PTI Indiana University
July 13th 2012

1. Introduction

Toward this hands on, you will learn to write your own SalsaDPI configuration file and run it locally on your laptop machine. Noted that, the same java executable will be use to run on Cloud environment of Hands on 2.

Prerequisite:

Hands-on 2 gudieline:

----------Important update--------------------

Bug: We found that there is bug in the provided jar executable /root/salsaDPI/salsaDPI.jar. Note that this bug will only be shown if your configuration file doesn't setup correctly: the program runs until some ssh/sftp connection which copies files (keys, binary, inputs, etc) from local to local/FutureGrid VMs, then, you may be facing java program termination without any warnings.

Solution: Please download the latest salsaDPI.jar and put it back to /root/salsaDPI inside the VirutalBox. We provide the Linux commands to download it inside the VirtualBox Image:

$ cd ~/salsaDPI/ 
$ wget http://salsahpc.indiana.edu/ScienceCloud/apps/salsaDPI/salsaDPI.jar

You could also use shared folder to copy it back under /root/salsaDPI/ inside the VirtualBox machine.

2. FutureGrid Eucalyptus preparetion

Check Downloaed FutureGrid Eucalyptus files (eucarc and VM ssh private key) within your VirtualBox image

Assuming you have got FutureGrid related dependent files, a eucarc and a ssh private key (e.g. johnny.pem), downloaded to your working machine.

root@ubuntu:~$ ls -l 
-rw------- 1 johnny johnny 983 2012-07-14 18:04 eucarc
-rw------- 1 johnny johnny 1751 2012-07-14 18:04 johnny.pem

3. Write your SalsaDPI configuration file for Hadoop/Twister applications

Once you all the prerequisites setup correctly, you could modify the SalsaDPI configuration to run your complied Hadoop and Twister applications on FutureGrid cloud. For this section, we use and modify a a json format template file cloud_twisterTemplate.json. This configuation must be correctly filled with full path of your compiled jar executable, your program inputs and your general program arguements.

root@ubuntu:~$ vi salsaDPI/cloud/templates/cloud_twisterTemplate.json

Example of Twister WordCount program

Before modifying the cloud_twisterTemplate.json, the Bold fields in this example will be replaced by user.

NOTED THAT this is an unfinished template need to be filled:

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'cloud',

// euca cloud parameters
'eucaInfo':{'eucarcFilePath':'#_Full_Path_to_eucarc_File_#',
'eucaImageEmi':'emi-A8F63C29',
'eucaSSHPublicKey':'#_Euca_Keypair_PublicKeyName_#',
'eucaVmType':'m1.small',
'amountOfInstances':2},


// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'#_Full_Path_to_ssh_Privatekey_File_#' },

// runtime softwares such as recipe[hadoopSandbox], recipe[twisterSandbox], recipe[hadoopCloud], and recipe[twisterCloud]
'softwareRecipes':['recipe[twisterCloud]'],

// user-defined application related parameters
'applicationParameters':{'applicationType':'Twister',
'localPathOfProgramBinary':'#_Full_Path_to_Program_Jar_File_#',
'localPathOfProgramInput':'#_Full_Path_to_Program_Input_#',
'localPathOfBinaryDependency':'#_Full_Path_to_Program_Dependency_#',
'programExecuteLocation':'#_Path_to_Execution_Bin_Dir_#',

'twisterInputFilesPreFix':'#_Twister_Inputs_Prefix_#',
'programArgs':'#_Program_Args_#'}
}

Description of sandbox configuration file:

Parameter Description
mode Execution mode, options: sandbox or cloud
eucaInfo A json object that contains cloud mode Eucalyptus related information, 'eucarcFilePath', 'eucaImageEmi', 'eucaSSHPublicKey', 'eucaVmType', and 'amountOfInstances'
eucarcFilePath Full path to downloaed eucarc file
eucaImageEmi Eucalyptus VM image registered on FutureGrid, e.g. emi-A8F63C29
eucaSSHPublicKey Eucalyptus public key name (which you setup during the FutureGrid Eucalyptus setting)
eucaVmType Eucalypus VM type, e.g. m1.small
amountOfInstances Amount of instances for this job, e.g. 2
ssh A json object that contains ssh information, SSHLoginUsername and SSHPrivateKeyPath
SSHLoginUsername Ssh login username, for cloud mode, it must be root.
SSHPrivateKeyPath Full path to ssh private key which uses to login to VM.
softwareRecipes Runtime softwares such as Hadoop and Twister that will be installed to the working machine(s). Current options: recipe[hadoopSandbox], recipe[twisterSandbox], recipe[hadoopCloud], and recipe[twisterCloud]
applicationParameters A json object that contains user-defined application's information
applicationType Type of user-defined application, options: Hadoop or Twister
localPathOfProgramBinary Full path of user-defined Hadoop or Twister compiled jar executable on the working machine
localPathOfProgramInput Full path of user-defined input file on the working machine, normally, a plaintext or a *.tar.gz file
localPathOfBinaryDependency Full path of user-defined program dependency file on the working machine, such as Twister Kmeans initial cluster file
programExecuteLocation Path to Twister program execution script refer to Twister package, such as samples/wordcount/bin or samples/kmeans/bin
twisterInputFilesPreFix Twister Input files prefix. Refer to the provided package, for Twister WordCount, the file prefixed is wc_data, for Twister Kmeans is km_data.
programArgs User-defined program execution command

In addition, in order to generate a general user program arguement to be executed in a dynamic environment , SalsaDPI framework provides general variables interface for user to fill the programArgs. Detail description could be seen in the following table.

Description of general variables for programArgs of applicationParameters objects, it will be replaced by the SalsaDPI when the program is scheduled on the working node:

Variables Description
#_JAR_# The user-defined jar file name
#_JOB_ID_# The job id, normally, it is default program output directory name on the remote worker node.
#_HDFS_INPUTDIR_# Hadoop type application's HDFS input directory name
#_HDFS_OUTPUTDIR_# Hadoop type application's HDFS output directory name
#_TWISTER_INPUTDIR_# Twister type application's Input directory on the working node
#_TWISTER_OUTPUTDIR_# Twister type application's output directory on the working node
#_TWISTER_PARTITION_FILE_# Twister type application's partition file
#_BINARY_DEPENDENCY_# Full Path to twister type program dependency. Mainly, it is used for Twister Kmeans init cluster file.

After modifying the cloud_twisterTemplate.json, we will have the following configuration file for Cloud Twister WordCount.

{  // Useful general variables of programArgs for applicationParameters object
// #_JAR_#, #_JOB_ID_#,
// #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,
// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud',
'mode':'cloud',

// euca cloud parameters
'eucaInfo':{'eucarcFilePath':'/root/eucarc',
'eucaImageEmi':'emi-A8F63C29',
'eucaSSHPublicKey':'johnny',
'eucaVmType':'m1.small',
'amountOfInstances':2},


// ssh passwordless related parameters
'ssh':{'SSHLoginUsername':'root',
'SSHPrivateKeyPath':'/root/johnny.pem' },

// runtime softwares such as recipe[hadoopSandbox], recipe[twisterSandbox], recipe[hadoopCloud], and recipe[twisterCloud]
'softwareRecipes':['recipe[twisterCloud]'],

// user-defined application related parameters
'applicationParameters':{'applicationType':'Twister',
'localPathOfProgramBinary':'/root/salsaDPI/apps/Twister-WordCount-0.9.jar',
'localPathOfProgramInput':'/root/salsaDPI/input/twisterWordCountInput.tar.gz',
'localPathOfBinaryDependency':'',
'programExecuteLocation':'samples/wordcount/bin',

'twisterInputFilesPreFix':'wc_data',
'programArgs':'./run_wc.sh #_TWISTER_PARTITION_FILE_# #_TWISTER_OUTPUTDIR_#/wc.out 4 1'}
}

4. Execute SalsaDPI with a user-defined application

Execute the salsaDPI jar executable with running sandbox Hadoop WordCount:

root@ubuntu:~$ cd salsaDPI
root@ubuntu:salsaDPI$ cp cloud/templates/cloud_twisterTemplate.json cloud/templates/cloud_twisterWordCount.json
root@ubuntu:salsaDPI$ java -cp salsaDPI.jar cgl.salsa.salsadpi.Driver cloud/templates/cloud_twisterWordCount.json

After the program finishes running, the output will copy to the working directory under <workingDir>/salsaDPI_output/<job_uuid>/output/*:

root@ubuntu:salsaDPI$ ls -l salsaDPI_output/1322fb55-650e-4f6b-8f90-45f7418bda08/output/ 

-rw-r--r-- 1 johnny johnny 1396 Jul 12 20:00 1322fb55-650e-4f6b-8f90-45f7418bda08.txt

5. Home Execrise Hadoop/Twister Kmeans

Based on the above example, please try to use the provided Hadoop template cloud_hadoopTemplate.json and Twister template cloud_twisterTemplate.json, modify them and schedule a sandbox/cloud mode Hadoop/Twister Kmeans.

The main difference for writting a Hadoop / Twister Kmeans configuration file is to change applicationParameters object, we here provide the following hints:

Hints for Hadoop Kmeans

'applicationParameters': { 
'applicationType':'Hadoop',
'localPathOfProgramBinary':'#_Path_HadoopKmeans_Jar_#',
'localPathOfProgramInput':'',
'localPathOfProgramDB':'',
'localPathOfBinaryDependency':'',
'programExecuteLocation':'',
'programArgs':'bin/hadoop jar #_JAR_# 500 10 8 3 #_JOB_ID_# > ~/#_JOB_ID_#/#_JOB_ID_#.txt'
}  

Hints for Twister Kmeans

'applicationParameters': {
'applicationType':'Twister',
'localPathOfProgramBinary':'#_FullPath_To_TwisterKmeans_JAR_#', 
'localPathOfProgramInput':'#_FullPath_To_TwisterKmeans_Inputs_GZ_File_#', 
'localPathOfBinaryDependency':'#_FullPath_To_TwisterKmeans_InitClusterFile_#', 
'programExecuteLocation':'samples/kmeans/bin',
'twisterInputFilesPreFix':'km_data', 
'programArgs':'./run_kmeans.sh #_BINARY_DEPENDENCY_# 80 #_TWISTER_PARTITION_FILE_# > #_TWISTER_OUTPUTDIR_#/#_JOB_ID_#.txt'

FAQ

Please see FAQ if you have any problem.

Reference