Home

User login

Introduction

Most scientific data analyses comprise analyzing voluminous data collected from various instruments. Efficient parallel/concurrent algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. The recently introduced MapReduce technique has gained a lot of attention from the scientific community for its applicability in large parallel data analyses. Although there are many evaluations of the MapReduce technique using large textual data collections, there have been only a few evaluations for scientific data analyses. The goals of this paper are twofold. First, we present our experience in applying the MapReduce technique for two scientific data analyses: (i) High Energy Physics data analyses; (ii) Kmeans clustering. Second, we present CGL-MapReduce, a streaming-based MapReduce implementation and compare its performance with Hadoop.

 
 
MapReduce

The framework supports:

  • Splitting of data
  • Passing the output of map functions to reduce functions
  • Sorting the inputs to the reduce function based on the intermediate keys
  • Quality of services

 
 

CGL-MapReduce – A Streaming-based MapReduce Runtime
  • A streaming based MapReduce runtime implemented in Java
  • All the communications(control/intermediate results) are routed via a content dissemination network
  • Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files
  • MRDriver
    • Maintains the state of the system
    • Controls the execution of map/reduce tasks
  • User Program is the composer of MapReduce computations
  • Support both single step and iterative MapReduce computations
 
 
CGL-MapReduce – Programming Model
  • Initialization
    • Start the map/reduce workers
    • Configure both map/reduce tasks (for configurations/fixed data)
  • Map
    • Execute map tasks passing <key, value> pairs
    • Content dissemination network transfers the map outputs directly to the reduce tasks
  • Reduce
    • Execute reduce tasks passing <key, List<values>>
    • Content dissemination network transfers the reduce outputs to the combine task
  • Combine
    • Combine the outputs of all the reduce tasks
  • Termination
    • Terminate the map/reduce workers

 
 

MapReduce for HEP data analysis

High Energy Physics (HEP) Data Analysis

HEP data analysis, execution time vs. the volume of data (fixed compute resources)

  • Hadoop and CGL-MapReduce both show similar performance
  • The amount of data accessed in each analysis is extremely large
  • Performance is limited by the I/O bandwidth
  • The overhead induced by the MapReduce implementations has negligible effect on the overall computation
 
 

Results of HEP data analysis

 
 
Kmeans Clustering

Overheads of different runtimes vs. the number of data points

  • All three implementations perform the same Kmeans clustering algorithm
  • Each test is performed using 5 compute nodes (Total of 40 processor cores)
  • Overhead diminishes with the amount of computation
  • Loosely synchronous MapReduce (CGL-MapReduce) also shows overheads close to MPI for sufficiently large problems
  • Hadoop’s high execution time is due to:
    • Lack of support for iterative MapReduce computation
    • Overhead associated with the file system based communication
MapReduce for HEP data analysis

 

 
  MapReduce for Matrix Multiplication

Matrix Multiplication

Matrix multiplication time vs. dimension of a matrix

  • Each test is performed using 5 compute nodes (Total of 40 processor cores)
  • Increase in the volume of data require more iterations
  • Higher iterations results higher overheads in Hadoop due to its lack of support for iterative computations
 

  Conclusion and Future Work
  • Given sufficiently large problems, all runtimes converge in performance
  • Streaming-based map reduce implementations provide faster performance necessary for most composable applications
  • Support for iterative MapReduce computations expands the usability of MapReduce runtimes
  • Integration of HDFS or similar parallel file system for input/output data
  • Architecture for fault tolerance CGL-MapReduce