|
MapReduce
The framework supports:
- Splitting of data
- Passing the output of map functions to reduce functions
- Sorting the inputs to the reduce function based on the intermediate keys
- Quality of services
|

|
|
|

|
CGL-MapReduce – A Streaming-based MapReduce Runtime
- A streaming based MapReduce runtime implemented in Java
- All the communications(control/intermediate results) are routed via a content dissemination network
- Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files
- MRDriver
- Maintains the state of the system
- Controls the execution of map/reduce tasks
- User Program is the composer of MapReduce computations
- Support both single step and iterative MapReduce computations
|
|
|
CGL-MapReduce – Programming Model
- Initialization
- Start the map/reduce workers
- Configure both map/reduce tasks (for configurations/fixed data)
- Map
- Execute map tasks passing <key, value> pairs
- Content dissemination network transfers the map outputs directly to the reduce tasks
- Reduce
- Execute reduce tasks passing <key, List<values>>
- Content dissemination network transfers the reduce outputs to the combine task
- Combine
- Combine the outputs of all the reduce tasks
- Termination
- Terminate the map/reduce workers
|

|
|
|
MapReduce for HEP data analysis 
|
High Energy Physics (HEP) Data Analysis
HEP data analysis, execution time vs. the volume of data (fixed compute resources)
- Hadoop and CGL-MapReduce both show similar performance
- The amount of data accessed in each analysis is extremely large
- Performance is limited by the I/O bandwidth
- The overhead induced by the MapReduce implementations has negligible effect on the overall computation
|
|
|

|
Results of HEP data analysis

|
|
|
Kmeans Clustering
Overheads of different runtimes vs. the number of data points
- All three implementations perform the same Kmeans clustering algorithm
- Each test is performed using 5 compute nodes (Total of 40 processor cores)
- Overhead diminishes with the amount of computation
- Loosely synchronous MapReduce (CGL-MapReduce) also shows overheads close to MPI for sufficiently large problems
- Hadoop’s high execution time is due to:
- Lack of support for iterative MapReduce computation
- Overhead associated with the file system based communication
|
MapReduce for HEP data analysis

|
|

|
|
|
MapReduce for Matrix Multiplication

|
Matrix Multiplication
Matrix multiplication time vs. dimension of a matrix
- Each test is performed using 5 compute nodes (Total of 40 processor cores)
- Increase in the volume of data require more iterations
- Higher iterations results higher overheads in Hadoop due to its lack of support for iterative computations
|
|

|
|
Conclusion and Future Work
- Given sufficiently large problems, all runtimes converge in performance
- Streaming-based map reduce implementations provide faster performance necessary for most composable applications
- Support for iterative MapReduce computations expands the usability of MapReduce runtimes
- Integration of HDFS or similar parallel file system for input/output data
- Architecture for fault tolerance CGL-MapReduce
|