Let's say you have to perform a 4-way case analysis and are given a choice between cascaded 2-way branches or a single 4-way branch.
What's the difference in branch misprediction penalties, assuming the 4 cases are random?
Denote the branch penalty by x (which is 15-20 cycles on Sky Lake according to Agner Fog).
With the 4-way branch, the worst-case branch penalty is x. The expected branch penalty is 3/4 x: there's a 3/4 chance of x penalty.
With the cascaded 2-way branches, the worst-case branch penalty is 2x. The expected branch penalty is 1/2 x + 1/4 2x = x: there's a 1/4 chance of 2x penalty, and there's a 1/4 + 1/4 = 1/2 chance of x penalty.
A difference of 1/4 x for x = 20 cycles is 5 cycles. If you want a 4-way branch implemented with a jump table to win, you need to keep the fixed overhead down to less than 5 cycles, at least in this toy model. An L1-hitting load with a complex addressing mode is 5 cycles on Sky Lake, but instructions that only feed into a branch instruction aren't part of the subsequent dependency chain as far as latency goes, so it isn't as simple as adding up the latencies.
Now suppose we extend this comparison to a n-way case analysis where n = 2^k. The expected penalty for the multi-way branch is (n-1)/n x, which is ~x for larger n. The expected penalty for the cascaded 2-way branches is given by the recurrence relation f(n) = 1/2 x + f(n/2), f(1) = 0, which has the closed-form solution f(n) = k/2 x.
There isn't really any grand conclusion from this other than a more quantitative assessment of how multi-way branches widen the gap as n increases. I'm just trying to see if there's some useful napkin math for thinking about different branch structures for case analysis. E.g. how skewed does the case distribution need to be before you should pull out the most frequent cases from the multi-way branch into leading 2-way branches. The easy answer is that you should test it on real hardware with real datasets and pick the winner, but it would be nice to have a mental model that captures the major factors in simpler instances.
What if the probabilities of the 4 cases are not all 25% ?
For example, with probabilities of
40% 30% 20% 10%
, 4-way chance of x penalty is0.4 * 0.6 + 0.3 * 0.7 + 0.2 * 0.8 + 0.1 * 0.9
=0.7
Cascaded 2-way expected penalty is
0.3 (first branch) + ( 0.4 * (3/7) + 0.3 * (4/7) + 0.2 * (1/3) + 0.1 * (2/3) )
=0.776
Which are not only both lower, but they're very close. Which probably means the break-even point is with only slightly more uneven probabilities than in this example. Computing that break-even point left as an exercise for the reader.
But now what if probabilities are not just uneven, but they're also correlated in time? That math gets too hairy for an example, but my guess is that Cascaded 2-way would get an even earlier break-even point as its advantage with uneven probabilities compounds, but I am just guessing here.
A real life example is this code: read a 1/2/4/8 byte scalar from memory, where the size is dynamic:
https://github.com/google/flatbuffers/blob/f12cca8bcb017618773aaa0f19623f44168bd57a/include/flatbuffers/flexbuffers.h#L130-L168
Here probabilities are indeed uneven (the smaller sizes are way more frequent) and also correlated (it is often used to access a vector of all the same size), hence why I assumed cascaded would be faster. But it really depends on the data and access patterns.