We currently assume that the “kernels” of the “scalable” parallel algorithms/applications/libraries will be built by experts with a broader group of programmers composing library members into complete applications. Experience from parallel programming (largely for scientific applications) breaks parallelism into two areas namely firstly building parallel “kernels” (libraries) and secondly composing parallel library components into complete applications:
a) Scalable Parallel Components: There are no agreed high-level programming environments for building library members that are broadly applicable. However lower level approaches where experts define parallelism explicitly are available and have clear performance models. These include MPI and CCR for messaging or just locks within a single shared memory. There are several patterns to support here including the collective synchronization of MPI, dynamic irregular thread parallelism needed in search algorithms, and more specialized cases like discrete event simulation.
b) Composition of Parallel Components: The composition step has many excellent solutions as this does not have the same drastic synchronization and correctness constraints as in scalable parallelism a). Approaches to composition include task parallelism in languages such as C++, C#, Java and Fortran90; general scripting languages like Python; domain specific environments like Matlab. Recent approaches include MapReduce, F# and DSS. Graphical interfaces were popularized with AVS and Khoros 10-15 years ago and recently are seen in Grid/Web Service workflow systems such as Taverna, InforSense KDE, Pipeline Pilot (from SciTegic) and the LEAD environment built at Indiana University. Mash-ups from Web 2.0 are also usable here. Many scientific applications use MPI for the coarse grain b) as well as fine grain area a). The new languages from Darpa’s HPCS program support task parallelism (composition of parallel components) but we see that decoupling composition and scalable parallelism will remain popular and must be supported.
Note a) and b) must be supported both inside chips (the multicore problem) and between machines in clusters (the traditional parallel computing problem) or Grids. The scalable parallelism problem a) is typically only interesting on true parallel computers as the algorithms require low communication latency. However composition b) appears similar in both parallel and distributed scenarios and it seems useful to allow the use of Grid and Web composition tools for the parallel problem. Thus we suggest that it is useful to capture parallel library members as (some variant of) services. For parallelism expressed in CCR, DSS represents the natural service model. Note that we are not assuming a uniform implementation and in fact expect good service composition inside a multicore chip to often require highly optimized communication mechanisms between the services that minimize memory bandwidth use. Between systems interoperability could motivate very different mechanisms to integrate services. Further bandwidth and latency requirements reduce as one increases the grain size of services and this again suggests the smaller services inside closely coupled cores and machines will have stringent communication requirements. The above discussion defines the “Service Aggregation” term in SALSA; library members will be built as services that can be used by non expert programmers.
We generalize the well known CSP (Communicating Sequential Processes) of Hoare to describe the low level approaches to area a) as “Linked Sequential Activities” in SALSA. We use activities in SALSA to allow one to build services from either threads, processes (usual MPI choice) or even just other services. We choose linkage to denote the different ways of synchronizing the parallel activities that may involve shared memory rather than some form of messaging or communication.
There are several engineering and research issues glossed over above. We mentioned the critical communication optimization problem area already. We need to discuss what we mean by services; the requirements of multi-language support; supporting implementations on multicore, cluster or Grid infrastructure. Further it seems useful to re-examine MPI and define a simpler model that naturally supports threads or processes and the full set of communication patterns mentioned above.