Let's say you have to perform a 4-way case analysis and are given a choice between cascaded 2-way branches or a single 4-way branch.
What's the difference in branch misprediction penalties, assuming the 4 cases are random?
Denote the branch penalty by x (which is 15-20 cycles on Sky Lake according to Agner Fog).
With the 4-way branch, the worst-case branch penalty is x. The expected branch penalty is 3/4 x: there's a 3/4 chance of x penalty.
With the cascaded 2-way branches, the worst-case branch penalty is 2x. The expected branch penalty is 1/2 x + 1/4 2x = x: there's a 1/4 chance of 2x penalty, and there's a 1/4 + 1/4 = 1/2 chance of x penalty.
A difference of 1/4 x for x = 20 cycles is 5 cycles. If you want a 4-way branch implemented with a jump table to win, you need to keep the fixed overhead down to less than 5 cycles, at least in this toy model. An L1-hitting load with a complex addressing mode is 5 cycles on Sky Lake, but instructions that only feed into a branch instruction aren't part of the subsequent dependency chain as far as latency goes, so it isn't as simple as adding up the latencies.
Now suppose we extend this comparison to a n-way case analysis where n = 2^k. The expected penalty for the multi-way branch is (n-1)/n x, which is ~x for larger n. The expected penalty for the cascaded 2-way branches is given by the recurrence relation f(n) = 1/2 x + f(n/2), f(1) = 0, which has the closed-form solution f(n) = k/2 x.
There isn't really any grand conclusion from this other than a more quantitative assessment of how multi-way branches widen the gap as n increases. I'm just trying to see if there's some useful napkin math for thinking about different branch structures for case analysis. E.g. how skewed does the case distribution need to be before you should pull out the most frequent cases from the multi-way branch into leading 2-way branches. The easy answer is that you should test it on real hardware with real datasets and pick the winner, but it would be nice to have a mental model that captures the major factors in simpler instances.
I just looked at the code you linked. I'm not sure how it would work out in practice given your use case, but all my code relies heavily on unaligned reads and writes for exactly this reason. You can see an example here: https://gist.github.com/pervognsen/12d358c8fdd531d21fbccdca251fc2cd#file-spaceskip-c-L15. You can usually ensure that the data you're reading has sufficient end-padding to avoid the issue with unowned memory. For something like a serialization format where you own both the write side and the read side, it's especially easy; for something like a programming language lexer as in the above example, it's a little trickier (but usually doable with some trickery) if you want to use memory-mapped file IO or read user data directly.