Workshop Schedule
(In
Central Time)
July 26- 10:00AM - Keynote: Data
Intensive Computing
Alex Szalay
@The Johns Hopkins University - 11:30AM - Break (lunch for Eastern, Central time)
- 12:30PM - Making the most of
the I/O Software Stack
Rob Latham
@Argonne National Lab - 2:00PM - Break (lunch Mountain, Pacific time)
- 3:00PM - Data movement &
Storage (Data Capacitor WAN Filesystem)
Justin Miller
@Indiana University - 4:00PM - Scalable and
Distributed Visualization using Paraview
Eric Wernert
@Indiana University - 5:30PM - Local Reception
- 10:00AM - Overview of
FutureGrid
Geoffrey Fox
@Indiana University - 10:45AM - Tutorial on using
FutureGrid
Craig Stewart
@Indiana University - 11:30AM - Break (lunch for Eastern, Central time)
- 12:30PM -
Tutorial on MapReduce and Hadoop
SALSA Group
@Indiana University - 2:00PM - Break (lunch Mountain, Pacific time)
- 3:00PM - Introduction to
Amazon EC2
Thilina Gunarathne
@Indiana University & IBM Research - 4:30PM - Break
- 5:00PM - Hands-on & Laboratory Time
- 7:00PM - Local Activities
- 10:00AM - Overview of Cloud Computing Platforms
Judy Qiu
@Indiana University - 10:45AM - Introduction to Azure
Jaliya Ekanayake
@Indiana University & Microsoft Research - 11:30AM - Break (lunch for Eastern, Central time)
- 12:30PM - Introduction to
DryadLINQ
Christophe Poulain
@Microsoft Research - 1:30PM - Iterative MapReduce
Jaliya Ekanayake
@Indiana Univesity & Microsoft Research - 2:00PM - Break (lunch Mountain, Pacific time)
- 3:00PM - Iterative MapReduce (continued)
Jaliya Ekanayake
@Indiana Univesity & Microsoft Research - 4:30PM - AzureMapReduce
Thilina Gunarathne
@Indiana University & IBM Research - 5:00PM - Hands-on & Laboratory Time
- 7:00PM - Local Activities
- 10:00AM - Data transport
(with specific TG examples) and file systems
Mahidhar Tatineni
@SDSC - 11:30AM - Break (lunch for Eastern, Central time)
- 12:30PM - Studying Science
from Large-Scale Usage Data
Johan Bollen
@Indiana University - 2:00PM - Break (lunch Mountain, Pacific time)
- 3:00PM - Big Data in Drug
Discovery
David Wild
@Indiana University - 4:00PM - Cancer epigenomics
study using the next generation sequencing data
Sun Kim
@Indiana University - 4:30PM - Hands-on & Laboratory Time
- 6:00PM - Local Activities
- 10:00AM - Keynote:
Distributed Data-Parallel Computing(Sector/Sphere)
Robert Grossman
@University of Illinois at Chicago
Tutorial: Sector/Sphere Installation and Usage - 11:30AM - Break (lunch for Eastern, Central time)
- 12:30PM - Plug-and-play
virtual appliance clusters running Hadoop
Renato Figueiredo
@University of Florida - 3:00PM - Virtual Observatory
Technologies
Tamas Budavari
@The Johns Hopkins University - 4:00PM - Final Q&A; Surveys; Adjourn
Big Data for Science Workshop
July 26-30, 2010, NCSA Summer School

Humans are generating, sensing, and harvesting massive amounts of digital data, and many of these unprecedentedly large data sets will be archived in their entirety. We find ourselves surrounded by huge volumes of "data at rest," that is, data written once and destined to live forever. Data movement will become the exception rather than rule.
Digital data owners will control the data distribution channels via "cloud computing" infrastructure where data is unstructured and devoid of schema, begging for semantic metadata, preservation, and curation. The familiar notions of sequential or random access files no longer apply in the cloud. Instead developers will write code that mines this mass of unstructured data, extracts what is of interest, and then inserts the resulting data subset into a relational database or other structured data store where it will be analyzed and visualized.
The disciplines on the forefront of this paradigm shift are astroscience, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by refining the techniques pioneered by Google and Facebook and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.
As the computing landscape becomes increasingly data-centric, computational scientists will employ new tools based on new models of computation. In a data-intensive world where the sheer volume of data demands new approaches and techniques, the inclination is to move the computation to the data, a basic theme underlying this course. Called the "fourth paradigm" (after theory, experiment, and computation), data-intensive computing is poised to transform scientific research.
Students will learn about:
- The notion of "data at rest" and its impact on data movement and computation
- The role of cloud infrastructure in data-intensive computing
- The need for semantic metadata, preservation, and curation of digital data
Participants will get hands-on programming experience with data-intensive computing languages such as MapReduce.
Instructors:
Geoffrey C. Fox, distinguished scientist and director, Community Grids
Lab, Pervasive Technology Institute, Indiana University
Judy Qiu, assistant director, Community Grids Lab, Pervasive Technology Institute, Indiana University
Prerequisites:
- Experience working in a Unix environment
- Experience developing and running scientific codes written in C, C++, Java, or a similar high-level programming language
Course outline:
- Opening Keynote: Data-intensive Computing
- Data Movement & Storage
- Data Mining
- Semantic Web
- Keynote: Distributed Data-Parallel Computing
- Cloud Computing Platforms (e.g., Hadoop, Azure)
- MapReduce for Big Data
- Hybrid Approaches to Big Data (e.g., Twister, HadoopDB, Sector/Sphere)
- MapReduce vs. SQL
- Performance Considerations
- Visualization of Large Data Sets
- Case Studies:
- Astronomy
- Bioinformatics
- Earth Science
- Hands-on Lab
NOTE: Students are required to provide their own laptops.
The following sites are fully participating in the Big Data for Science course:
- Arkansas High Performance Computing Center, University of Arkansas, Fayetteville
- Electronic Visualization Laboratory, University of Illinois at Chicago
- Indiana University, Bloomington
Location Info
Room 105
Indiana University Innovation Center
2719 East 10th St, Bloomington, IN 47408
- Institute for Digital Research and Education, University of California, Los Angeles
- Michigan State University, East Lansing
- Pennsylvania State University, University Park
- University of Iowa, Iowa City
- University of Minnesota Supercomputing Institute, Minneapolis
- University of Notre Dame, Notre Dame, Indiana
- University of Texas at El Paso
The following sites will host remote presenters (with no audience):
- IBM Almaden Research Center (San Jose, California)
- University of Washington, Seattle
- San Diego Supercomputer Center