There's a nice Xilinx application note from Ken Chapman:
Multiplexer Design Techniques for Datapath Performance with Minimized Routing Resources https://www.xilinx.com/support/documentation/application_notes/xapp522-mux-design-techniques.pdf
The most intriguing part is what he calls a data selector, but which might be more descriptively termed a one-hot mux, since it's a mux where the control signals are one-hot coded:
onehotmux(data, sel) = (data[0] & sel[0]) | ... | (data[n] & sel[n])
Chapman shows how onehotmux can be implemented with the help of an FPGA's carry chains, but the explanation wasn't very clear to me. So I'll try to explain it here in programmer terms.
Low-level-oriented software hackers are familiar with bitwise tricks that exploit "carry ramps" where a carry propagates from a lower part of a word through a ramp of 1s into a higher part. You can understand Chapman's trick quite simply with this concept. Let's look at a few examples:
11111011
+ 00000001
= 011111100
11111111
+ 00000001
= 100000000
11011111
+ 00000001
= 011100000
If you look at the carry bit of the result (the 9th bit in these examples) as the output of the computation, then you can see that the output is 1 if and only if the input bits are all 1s.
Back to the one-hot mux. Here's the formula for what Chapman does the way I would write it:
onehotmux(data, sel) = ((data | ~sel) + 1)[n+1]
The idea is that we selectively introduce a gap in the carry ramp to control the carry output.
Since it's a one-hot mux, all but one of the selector signals is 0, so all but one of the addition term bits will be 1. Say the 4th bit is selected:
1111x111
+ 00000001
Here sel[3] is 1, so x = (data[3] | ~sel[3]) = data[3] and the carry-out is hence equal to data[3].
There's a nice logical interpretation since a | ~b is equivalent to material implication b => a.
In Chapman's application note, he combines this with a standard FPGA trick of doing a 3:1 selection in the LUT6 before feeding into the carry chain, so you can do up to 4x3 = 12 one-hot mux selections with the resources in a single CLB.
Aside from conceptual clarity, which is an admittedly subjective quality, there's at least one objective advantage to my formulation of this trick: FPGA synthesis tools should have no problem mapping this formula onto carry chains, whereas Chapman's formulation requires hand-mapping the CLB resources, which is hard to maintain and highly vendor specific.