Loading CSV files into HBase table
 

SALSA Group
PTI Indiana University
July 17th 2012

Contents

  1. Introduction
  2. Prerequisite
  3. HBase MapReduce Program
  4. Run Sample Program

1. Introduction

This tutorial basically shows example of how to load CSV file into HBase table with HBase MapReduce API.
CSV represent comma seperate values file, which is a common file format in many fields such as Flow Cytometry in bioinformatics.

2. Prerequisite

Ubuntu VM Environment. Download following VM and launch VM in VirtualBox:
    http://salsahpc.indiana.edu/ScienceCloud/apps/salsaDPI/virtualbox/chef_ubuntu.ova

Sample CSV file. input.csv in hbasetutorial package:
    http://156.56.93.128/PBMS/doc/hbasetutorial.tar

3. HBase MapReduce Program

main entry point of program

    public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if(otherArgs.length != 2) {
System.err.println("Wrong number of arguments: " + otherArgs.length);
System.err.println("Usage: " + NAME + " < input > < tablename > ");
System.exit(-1);
}
Job job = configureJob(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

configure Hbase MapReduce Job

    public static Job configureJob(Configuration conf, String [] args)
throws IOException {
Path inputPath = new Path(args[0]);
String tableName = args[1];
Job job = new Job(conf, NAME + "_" + tableName);
job.setJarByClass(Uploader.class);
FileInputFormat.setInputPaths(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Uploader.class);
// No reducers. Just write straight to table. Call initTableReducerJob
// because it sets up the TableOutputFormat.
TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);
return job;
}

map task to load csv lines into hbase

public void map(LongWritable key, Text line, Context context) throws IOException {
// Each map() is a single line, where the key is the line number
String [] values = line.toString().split(",");
if(values.length != 4) {
System.out.println("err values.length!=4 len:"+values.length);
System.out.println("input string is:"+line);
return;
}
// Extract each value
byte [] row = Bytes.toBytes(values[0]);
byte [] family = Bytes.toBytes(values[1]);
byte [] qualifier = Bytes.toBytes(values[2]);
byte [] value = Bytes.toBytes(values[3]);
Put put = new Put(row);
put.add(family, qualifier, value);
try {
context.write(new ImmutableBytesWritable(row), put);
} catch (InterruptedException e) {
e.printStackTrace();
}
// Set status every checkpoint lines
if(++count % checkpoint == 0) {
context.setStatus("Emitting Put " + count);
} } }

4. Run Sample Program

Six steps to run sample program.
1) to launch hadoop cluster
  start-all.sh 
2) to launch hbase in local mode
  start-hbase.sh 
3) to compile program
  ant 
4) put input.csv file into hdfs
     hadoop dfs -mkdir input
hadoop dfs -copyFromLocal input.csv input
5) create hbase table with following schema
     hbase shell
create "csv_table","f1"
6) run program
     hadoop jar dist/lib/cglHBaseSummerSchool.jar iu.pti.hbaseapp.CSV2HBase input/input.csv csv_table