Home

User login

Contents

Multicore architectures are bringing parallel computing to a broad range of applications with profound impact on hardware, systems software and applications [1] [2] [3]. The programming models and runtime that will be used on multicore architectures is the subject of substantial academic and industry debate as they must bridge between current commercial desktop and server systems, commercial parallel databases, distributed Grid environments and the massively parallel supercomputers largely aimed at science and engineering [4]. [5] has examined classes of possible future desktop applications which they term RMS – Recognition, Mining and Synthesis. This can be illustrated by a sophisticated datamining application that first access a parallel database and then run analysis algorithms including a multithreaded branch and bound search, a SVM Support Vector Machine built on parallel linear algebra followed by sophisticated visualization. This composite application would need to be controlled by a coarse grain executive similar to Grid workflow [6] or Google MapReduce [7]. The individual datamining filters could use either the thread parallelism appropriate for search algorithm or an MPI style messaging for parallel linear algebra. Further we would like this job to run efficiently and seamlessly either internally to a single CPU or across a tightly coupled cluster or distributed Grid. We would like to start with addressing a small part of the multicore puzzle – namely what could be the runtime that could span these different environments and different platforms that would be used by the heterogeneous composite applications that could be common on future multicore applications for personal, commercial or research use.

We examine the issues of building a multi-paradigm runtime with a particular example CCR (Concurrency and Coordination Runtime) which is a runtime [8] [9] designed for robotics applications but also investigated [10] as a general programming paradigm. CCR supports efficient thread management for handlers (continuations) spawned in response to messages being posted to ports. The ports (queues) are managed by CCR which has several primitives supporting the initiation of handlers when different message/port assignments are recognized. Note that CCR supports a particular flavor of threading where information is passed by messages allowing simple correctness criteria. We evaluate a possible low level runtime which could support a variety of different parallel programming models that would map down into it. In particular the new generation of parallel languages from Darpa’s HPCS High Productivity Computing System program supports the three execution styles (dynamic threading, MPI, coarse grain functional parallelism) we investigate here.

We conduct performance measurement of CCR from Microsoft as a multi-paradigm runtime supporting three key parallel models:

  • full MPI collective messaging
  • asynchronous threading
  • coarse grain functional parallelism or workflow

 

 

Basic computation test

Basic computation test is the performance measurement of a basic computation unit, which is defined as a variation of summation:

Image:summation_175by67_pixels.png (Equation 1)

The basic computation unit is measured as the execution time of the above computation running sequentially on a single core in seconds (see table of Basic computation test). We code the basic computation unit such that it has a fixed computation with n (in Equation 1) being split into a core computation activity (inner loop of fixed size = 100) and repetitions (outer loop that varies up to loop = 107). The const parameter is set to 9.87654321. Based on this core computation unit (see code segment below), we use loop and stage variables together to achive identical computational load in a suite of CCR test in MPI-style patterns. 

  double dsum = 0.0; 

  for (int j = 1; j <= loop; j++) 
  {
      double sum = 0.0; 
       
for (int i = 1; i <= size; i++)
{
sum += 1.0 / (((double)i + const) * (double)i);
};
sum += sum; } return dsum;
Code segment of the basic computation unit

 

Corresponding benchmark performance of individual multicore machines are achived (see Table 1). They are the average results of the unit computation program runs sequentially on difference machines that we look at (for further details, ref. Testing environment)

 

Table 1: Machine used and benchmark performance
Name Model Benchmark performance System details
AMD4 HPxw9300 workstation 1.388 μs 2 AMD Opteron CPUs, Processor 275 @ 2.19GHz, L2 Cache 2x1MB (for each chip), Memory 4GB, XP Pro 64bit
Intel4 Dell Precision PWS670 1.475 μs 2 Intel Xeon CPUs @ 2.80Ghz, L2 Cache 2x2MB, Memory 4GB, XP Pro 64bit
Intel8a Dell Precision PWS690 1.696 μs 2 Intel Xeon CPUs, Processor E5320 @ 1.86GHz, L2 Cache 2x4MB, Memory 8GB, XP Pro 64bit
Intel8b Dell Precision PWS690 1.188 μs 2 Intel Xeon CPUs, Processor 5355 @ 2.66 GHz, L2 Cache 2x4MB, Memory 4GB, Vista Ultimate 64bit

 

 

CCR test

Concurrency and Coordination Runtime (CCR) provides a framework for building general collective communication where threads can write to a general set of ports and read one or more messages from one or more ports. The framework manages both ports and threads with optimized dispatchers that can efficiently iterate over multiple threads. The current applications and provided primitives support what we call the dynamic threading model with capabilities that include:

  1. FromHandler:   Spawn threads without reading ports
  2. Receive:   Each handler reads one item from a single port
  3. MultipleItemReceive:   Each handler reads a prescribed number of items of a given type from a given port. Note items in a port can be general structures but all must have same type.
  4. MultiplePortReceive:   Each handler reads a one item of a given type from multiple ports.
  5. JoinedReceive:   Each handler reads one item from each of two ports. The items can be of different type.
  6. Choice:   Execute a choice of two or more port-handler pairings
  7. Interleave:   Consists of a set of arbiters (port -- handler pairs) of 3 types that are Concurrent, Exclusive or Teardown (called at end for clean up). Concurrent arbiters are run concurrently but exclusive handlers are not.

As part of our experiment, we did use the CCR framework to build five communication patterns as shown in figures bellow:

Five communication patterns using CCR to test spawned dynamic threading (a, b, c) and MPI style Rendezvous's (d, e)

One can spawn threads that consume messages as is natural in a dynamic search application where threads correspond to links in a tree (see results of spawend MPI patterns). However one can also have long running threads for "Pipeline", "Shift", "Two Shifts" patterns (see MPINonRendezvousSuite.cs), as well as "Rendezvous" pattern when messages are sent and consumed at a rendezvous points as used in traditional MPI applications (see results of Rendezvous patterns). Note that “active messages” correspond to the spawning model of CCR and can be straightforwardly supported. Further CCR takes care of all the needed queuing and asynchronous operations that avoid race conditions in complex messaging. We built an optimized collective operation corresponding to the MPI “exchange” operation but used existing capabilities for the “reduce” and “shift” patterns (see overhead of MPI patterns). We believe one can extend this work to provide all MPI messaging patterns. Note that all our work was for managed code in C# which is an important implementation language for commodity desktop applications.

The performance results of CCR on message latency and bandwidth for two processor multicore systems are based on AMD and Intel architectures with a total of four and eight cores (see Testing environment). Generating up to 4 million messages per second on a single PC, we find on the AMD based PC, latencies from 4µs in basic asynchronous threading and point to point mode to 20 µs for a full MPI_SENDRECV exchange with all threads (one per core) sending and receiving 2 messages at a traditional MPI loosely synchronous rendezvous. Workflow latencies are measured as 40 µs with all results coming from the CCR freely available as part of the Microsoft Robotics Studio distribution. We present Intel latencies that are significantly higher.

 

DSS test

Decentralized Software Services (DSS) is a lightweight, service oriented application model that sits on top of CCR. It is particularly suited for creating coarse grain application in the Web-style as compositions of services running in a distributed environment. Services are isolated from each other, even when running within the same node and are only exposed through their state and a uniform set of operations over that state. The DSS runtime provides a hosting environment for managing services and a set of infrastructure services that can be used for service creation, discovery, logging, debugging, monitoring, and security. DSS builds on existing Web architecture and extends the application model provided by HTTP through structured data manipulation and event notification. Interaction with DSS services happen either through HTTP or DSSP which is a SOAP-based protocal for managing structured data manipulations and event notifications.

The communication between two DSS services are measured with "Get" ("request/response") and "Replace" ("notification") (see DSS test).

 

 

Reference

[1]   Annotated list of multicore Internet sites from Geoffrey Fox http://www.connotea.org/user/crmc/
[2]   Jack Dongarra Editor The Promise and Perils of the Coming Multicore Revolution and Its Impact, CTWatch Quarterly Vol 3 No. 1 February 07, http://www.ctwatch.org/quarterly/archives/february-2007
[3]   Herb Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, 30(3), March 2005.
[4]   Geoffrey Fox tutorial at Microsoft Research Parallel Computing 2007: Lessons for a Multicore Future from the Past February 26 to March 1 2007
[5]   Pradeep Dubey Teraflops for the Masses: Killer Apps of Tomorrow Workshop on Edge Computing Using New Commodity Architectures, UNC 23 May 2006 http://gamma.cs.unc.edu/EDGE/SLIDES/dubey.pdf
[6]   Dennis Gannon and Geoffrey Fox, Workflow in Grid Systems Concurrency and Computation: Practice & Experience 18 (10), 1009-19 (Aug 2006), Editorial of special issue prepared from GGF10 Berlin http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf
[7]   Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004 [http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/mapreduce.html
[8]   Concurrency Runtime: An Asynchronous Messaging Library for C# 2.0? George Chrysanthakopoulos Channel9 Wiki Microsoft http://channel9.msdn.com/wiki/default.aspx/Channel9.ConcurrencyRuntime
[9]   Concurrent Affairs: Concurrent Affairs: Concurrency and Coordination Runtime? Jeffrey Richter Microsoft http://msdn.microsoft.com/msdnmag/issues/06/09/ConcurrentAffairs/default.aspx
[10]   Georgio Chrysanthakopoulos and Satnam Singh An Asynchronous Messaging Library for C#? Synchronization and Concurrency in Object-Oriented Languages (SCOOL) at OOPSLA October 2005 Workshop, San Diego, CA. http://urresearch.rochester.edu/handle/1802/2105