Skip to content

Instantly share code, notes, and snippets.

@edecoux
Created September 7, 2022 22:08
Show Gist options
  • Select an option

  • Save edecoux/412ff3e94ad13499098e437ede040e7a to your computer and use it in GitHub Desktop.

Select an option

Save edecoux/412ff3e94ad13499098e437ede040e7a to your computer and use it in GitHub Desktop.
Unified Parallel Runtime for NUMA Machines.md

MPC—A Unified Parallel Runtime for Clusters of NUMA Machines

![[CleanShot 2022-09-07 at 02.55.46@2x.png]]

Fig.1. Execution model of MPC, with an example involving both MPC-tasks and MPC-threads (hybrid distributed/shared memory approach)

The Message Passing Interface (MPI) Has Become a Very Successful Parallel Programming Environment for DisTributed Memory Architectures Such as Clusters

  • However, the architecture of cluster nodes is currently evolving from small symmetric shared memory multiprocessors to massively multicore, Non-Uniform Memory Access (NUMA) hardware
  • In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications.

MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

  • Currently, the architecture of cluster nodes is evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware
  • The emergence of these deeply hierarchical architectures raises the need for a careful distribution of threads and data
  • Parallel programming methods have to perfectly match the underling architecture to achieve high performance
  • Shared memory approaches, based on explicit multithreading, are more accurate on shared memory architectures
  • Hybrid MPI + OpenMP Approaches
  • Advanced MPI Implementations
  • Process Virtualization
  • Multi-leaving architectures, such as MPI and OpenMP, are highly optimized for specific architectures, but are not comprehensive to each other

Optimized NUMA-aware and Thread-aware Allocator

  • The aim of the lazy update method is to perform the lowest number of updates
  • It only requires a check at each collective communication call to determine if the current core used by the calling task is the same as the one used for the previous collective call
  • In the migration case, first of all, the called task is temporary moved to its previous core, then it performs collective communication calls
  • Finally, it schedules a collective communication initialization which will be performed by all tasks at the next collective call

Experiments Reported Here Are Based on Two Basic Numerical Kernels:

  • An advection benchmark
  • In SPMD parallel mode, the scheme just requires one point-to-point communication per direction (update of ghost cells), and one reduction for the prediction of the next time step (CFL condition).
  • The second numerical kernel is a conduction benchmark, corresponding to a 2D Cartesian grid implicit heat conduction solver, based on a a five-point stencil and a Conjugate Gradient method with diagonal preconditionning.

Scalability Results with Domain Overloading

  • MPI and MPC implementations reach similar performances on both architectures
  • Overloading allows to gain more than 10% and up to 20% on the execution time, without any modification to the original code
  • Memory Allocation and Data Placement Results
  • In order to evaluate the MPC NUMA-aware and thread-aware performance, the advection benchmark on TERA-10 has been performed
  • Results are given in Figure 3
  • The first part of each curve illustrates the scalability of MPC versus MPI
  • Comparison of MPI Bull to MPC Bull shows the rather good performances achieved by MPC
  • Evaluation of scalability and overloading method on representative scientific computing 2D code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment