![[CleanShot 2022-09-07 at 02.55.46@2x.png]]
Fig.1. Execution model of MPC, with an example involving both MPC-tasks and MPC-threads (hybrid distributed/shared memory approach)
The Message Passing Interface (MPI) Has Become a Very Successful Parallel Programming Environment for DisTributed Memory Architectures Such as Clusters
- However, the architecture of cluster nodes is currently evolving from small symmetric shared memory multiprocessors to massively multicore, Non-Uniform Memory Access (NUMA) hardware
- In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications.
- Currently, the architecture of cluster nodes is evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware
- The emergence of these deeply hierarchical architectures raises the need for a careful distribution of threads and data
- Parallel programming methods have to perfectly match the underling architecture to achieve high performance
- Shared memory approaches, based on explicit multithreading, are more accurate on shared memory architectures
- Hybrid MPI + OpenMP Approaches
- Advanced MPI Implementations
- Process Virtualization
- Multi-leaving architectures, such as MPI and OpenMP, are highly optimized for specific architectures, but are not comprehensive to each other
- The aim of the lazy update method is to perform the lowest number of updates
- It only requires a check at each collective communication call to determine if the current core used by the calling task is the same as the one used for the previous collective call
- In the migration case, first of all, the called task is temporary moved to its previous core, then it performs collective communication calls
- Finally, it schedules a collective communication initialization which will be performed by all tasks at the next collective call
- An advection benchmark
- In SPMD parallel mode, the scheme just requires one point-to-point communication per direction (update of ghost cells), and one reduction for the prediction of the next time step (CFL condition).
- The second numerical kernel is a conduction benchmark, corresponding to a 2D Cartesian grid implicit heat conduction solver, based on a a five-point stencil and a Conjugate Gradient method with diagonal preconditionning.
- MPI and MPC implementations reach similar performances on both architectures
- Overloading allows to gain more than 10% and up to 20% on the execution time, without any modification to the original code
- Memory Allocation and Data Placement Results
- In order to evaluate the MPC NUMA-aware and thread-aware performance, the advection benchmark on TERA-10 has been performed
- Results are given in Figure 3
- The first part of each curve illustrates the scalability of MPC versus MPI
- Comparison of MPI Bull to MPC Bull shows the rather good performances achieved by MPC
- Evaluation of scalability and overloading method on representative scientific computing 2D code