Let's say you have to perform a 4-way case analysis and are given a choice between cascaded 2-way branches or a single 4-way branch.
What's the difference in branch misprediction penalties, assuming the 4 cases are random?
Denote the branch penalty by x (which is 15-20 cycles on Sky Lake according to Agner Fog).
With the 4-way branch, the worst-case branch penalty is x. The expected branch penalty is 3/4 x: there's a 3/4 chance of x penalty.
With the cascaded 2-way branches, the worst-case branch penalty is 2x. The expected branch penalty is 1/2 x + 1/4 2x = x: there's a 1/4 chance of 2x penalty, and there's a 1/4 + 1/4 = 1/2 chance of x penalty.
A difference of 1/4 x for x = 20 cycles is 5 cycles. If you want a 4-way branch implemented with a jump table to win, you need to keep the fixed overhead down to less than 5 cycles, at least in this toy model. An L1-hitting load with a complex addressing mode is 5 cycles on Sky Lake, but instructions that only feed into a branch instruction aren't part of the subsequent dependency chain as far as latency goes, so it isn't as simple as adding up the latencies.
Now suppose we extend this comparison to a n-way case analysis where n = 2^k. The expected penalty for the multi-way branch is (n-1)/n x, which is ~x for larger n. The expected penalty for the cascaded 2-way branches is given by the recurrence relation f(n) = 1/2 x + f(n/2), f(1) = 0, which has the closed-form solution f(n) = k/2 x.
There isn't really any grand conclusion from this other than a more quantitative assessment of how multi-way branches widen the gap as n increases. I'm just trying to see if there's some useful napkin math for thinking about different branch structures for case analysis. E.g. how skewed does the case distribution need to be before you should pull out the most frequent cases from the multi-way branch into leading 2-way branches. The easy answer is that you should test it on real hardware with real datasets and pick the winner, but it would be nice to have a mental model that captures the major factors in simpler instances.
The reason I kept the analysis to the simplest case of fully random cases is that it seems too hard to analyze with napkin math beyond that point given how modern branch predictors work. You can see some examples in this thread: https://twitter.com/pervognsen/status/1256369404817715200