MPC—A Unified Parallel Runtime for Clusters of NUMA Machines

![[CleanShot 2022-09-07 at 02.55.46@2x.png]]

Fig.1. Execution model of MPC, with an example involving both MPC-tasks and MPC-threads (hybrid distributed/shared memory approach)

The Message Passing Interface (MPI) Has Become a Very Successful Parallel Programming Environment for DisTributed Memory Architectures Such as Clusters

However, the architecture of cluster nodes is currently evolving from small symmetric shared memory multiprocessors to massively multicore, Non-Uniform Memory Access (NUMA) hardware
In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications.

Currently, the architecture of cluster nodes is evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware
The emergence of these deeply hierarchical architectures raises the need for a careful distribution of threads and data
Parallel programming methods have to perfectly match the underling architecture to achieve high performance
Shared memory approaches, based on explicit multithreading, are more accurate on shared memory architectures
Hybrid MPI + OpenMP Approaches
Advanced MPI Implementations
Process Virtualization
Multi-leaving architectures, such as MPI and OpenMP, are highly optimized for specific architectures, but are not comprehensive to each other

The aim of the lazy update method is to perform the lowest number of updates
It only requires a check at each collective communication call to determine if the current core used by the calling task is the same as the one used for the previous collective call
In the migration case, first of all, the called task is temporary moved to its previous core, then it performs collective communication calls
Finally, it schedules a collective communication initialization which will be performed by all tasks at the next collective call

An advection benchmark
In SPMD parallel mode, the scheme just requires one point-to-point communication per direction (update of ghost cells), and one reduction for the prediction of the next time step (CFL condition).
The second numerical kernel is a conduction benchmark, corresponding to a 2D Cartesian grid implicit heat conduction solver, based on a a five-point stencil and a Conjugate Gradient method with diagonal preconditionning.

MPI and MPC implementations reach similar performances on both architectures
Overloading allows to gain more than 10% and up to 20% on the execution time, without any modification to the original code
Memory Allocation and Data Placement Results
In order to evaluate the MPC NUMA-aware and thread-aware performance, the advection benchmark on TERA-10 has been performed
Results are given in Figure 3
The first part of each curve illustrates the scalability of MPC versus MPI
Comparison of MPI Bull to MPC Bull shows the rather good performances achieved by MPC
Evaluation of scalability and overloading method on representative scientific computing 2D code