How can one understand and predict the performance of multicore algorithms?
As a first step, I'd recommend acquiring a basic understanding of the MSI protocol. Modern systems use somewhat more complex protocols, but a basic understanding of MSI should be enough to predict the most important phenomena.
Below is an even further simplified model of shared memory based on the MSI protocol. While the model is not precise, and makes several simplifications not based on actual hardware, it should be accurate enough to predict many phenomena, including the false sharing -performance pitfall, for example.
The memory is divided into locations called cache lines. Each core has its own copy of the whole memory (all cache lines) and additionally considers each of its own cache lines to be in one of the following three states:
ModifiedSharedInvalid
State changes do not happen spontaneously. Some core needs to perform a read from or a write to a cache line.
When a cache line of a core is in the Modified state, both reads and writes
are cheap. The state of the cache line needs to be Invalid in all the other
cores and neither read nor write require changing the state of the cache line in
those other cores.
Reading a cache line in the Shared state is cheap. Writing to a cache line in
the Shared state is more expensive. It means that the state of the cache line
in the core that performs the write is changed to the Modified state.
Additionally, the state of the corresponding cache line in all the other cores
is changed to the Invalid state.
When a core wants to read a cache line in the Invalid state, it needs another
core that has that cache line in either Modified or Shared state to send a
copy. If the sender was in Modified state, then the sender must change state
to Shared. The reader also sets the state to Shared. Writing to an Invalid
cache line also requires another core that has that cache line in Modified or
Shared state to send a copy of the cache line and then all the cores that have
the cache line in Modified or Shared state must change their state to
Invalid. The writer sets the state to Modified.
The various state changes and transmissions of cache line contents have
potentially different costs. Transitions from the Invalid state to other
states tend to be expensive.
While drastically simplified, this model should be accurate enough to help to predict the performance of many multicore algorithms. To predict the performance of a specific algorithm relative to another, determine and count the different state changes and cache line transfers for both algorithms.
This gist also includes a program written for Apple M1 to try to quantify the relative costs of various state changes. On my Apple M1 laptop I get roughly the following output from the program:
Relative to fastest:
Read M: 1.00 *
Write M: 2.58 ***
Read S: 1.04 *
Write S: 85.40 *************************************************************************************
Read Is: 2.36 **
Write Is: 31.34 *******************************
Read Im: 5.02 *****
Write Im: 32.61 *********************************
Note that the program uses constants (related to L1 cache line size and cache size) that are specific to Apple M1.