Here's a definition of the packed 16-bit RGB565 pixel format as a C struct:
typedef struct { unsigned r : 5, g : 6, b : 5; } pixel;
and a couple of functions that operate on it:
void process_green(short green);
void process_pixel(pixel px) {
process_green(px.g << 2);
}
void process_grb(unsigned g, unsigned r, unsigned b) {
process_pixel((pixel){r,g,b});
}
These functions both involve many bit-packing operations: process_green
must
extract the 6-bit g
field, shift it left to become 8 bits wide, then convert
it to short
to pass it on. On the x86-64 calling convention, short
arguments
have 16 bits of data, which should be sign-extended to 32 bits and then
zero-extended from there to 64 bits. (This seems, and is, very odd. However,
it's what you naturally get on x86 using the 16-to-32 bit sign extension
instructions). process_grb
is similar, needing an extra few shifting and
masking operations to assemble the packed bits. I've chosen to use an unusual
grb
order for the channels here, so that the g
parameter stays in the rdi
register all the time, making the resulting assembly slightly easier to read.
Clang/LLVM 20 and GCC 15 both generate good code for these functions. Below is the Clang version, but GCC is almost identical:
process_pixel(pixel):
shr edi, 3
and edi, 252
jmp process_green(short)@PLT
process_grb(unsigned int, unsigned int, unsigned int):
shl edi, 2
movzx edi, dil
jmp process_green(short)@PLT
All of the bit-packing has collapsed to the optimal sequence of a shift and a
mask. (The movzx
instruction is a zero-extension instruction from 8 bits,
equivalent to masking with 0xff
).
But if we change the code to use signed fields by replacing unsigned
with
int
above, then the codegen gets much noisier, with Clang producing:
process_pixel(pixel):
shr edi, 3
movsx eax, dil
and eax, 65532
movsx edi, ax
jmp process_green(short)@PLT
process_grb(int, int, int):
shl edi, 10
movsx edi, di
sar edi, 8
jmp process_green(short)@PLT
and GCC producing:
process_pixel(pixel):
sal edi, 5
sar di, 10
sal edi, 2
movsx edi, di
jmp process_green(short)
process_grb(int, int, int):
sal edi, 2
sar dil, 2
movsx di, dil
sal edi, 2
movsx edi, di
jmp process_green(short)
Neither of these is optimal. This function is genuinely trickier than the
unsigned version: in particular, the sign-extension to short
cannot simply be
optimised away. However, there's no need to do multiple sign extension
operations in each function.
Here are some notes on compiling such sequences to good code, with an implementation attached below.
The intended application is the OCaml compiler, which is currently behind Clang and GCC for
this sort of code. Despite that, it's particularly important for OCaml: all
standard library integer types are signed (so sign-extensions are common), and
the tagged integer representation means that some bit-packing occurs even in
programs that just use plain int
.
With these techniques, the two functions above can be compiled to:
process_pixel(pixel):
shl edi, 21
sar edi, 26
shl edi, 2
jmp process_green(short)
process_grb(int, int, int):
shl edi, 26
sar edi, 24
jmp process_green(short)
The approach here is based on bit-windowing functions W(x)
, where the
function W
can be expressed in normal form as a 6-tuple (i,j,k,l,s,T)
,
describing the following operation:
64 j i 0
+--------------------------------------------------------------+
x: | | data | |
+--------------------------------------------------------------+
64 s l k 0
+--------------------------------------------------------------+
W(x): | 0 | sign | data | T |
+--------------------------------------------------------------+
That is, bits [j:i]
of the input are extracted and shifted to position [l:k]
of the output, which is then sign-extended to position s
and zero-extended
from there to the end. The low bits (below k
) are filled with the constant
T
.
To be a well-formed bit-window function, the tuple must satisfy the following (assuming a 64-bit machine):
- 0 ≤ i < j ≤ 64
- 0 ≤ k < l ≤ s ≤ 64
- j - i = l - k
- T < 2^k
Note in particular the strict <
: the window must include at least one bit.
The third constraint, that the input and output windows are the same length,
means that the 6-tuple representation is slightly redundant: we always have l = k + j - i
. However, it's much simpler to manipulate these tuples if we have l
available explicitly.
The T
component is not essential: more or less everything below still works
fine if you assume T
to always be zero. However, for the use-case of OCaml
integer tagging (shift left by one and set the low bit), it's convenient to
combine updating the low bits into bit-windowing functions.
To make things a bit more readable, I'll write the tuples in the syntax [j:i] → s/[l:k]+T
.
The 6-tuple is a useful representation of bit-windowing functions:
-
They include basic shift and mask operations:
[64-n:0] → 64/[ 64:n]+0
is left shift byn
[ 64:n] → 64-n/[64-n:0]+0
is right logical shift byn
[ 64:n] → 64/[64-n:0]+0
is right arithmetic shift byn
[ 8:0] → 8/[ 8:0]+0
is 8- to 64-bit zero extension[ 32:0] → 64/[ 32:0]+0
is 32- to 64-bit sign extension[ j:i] → j/[ j:i]+0
is masking to keep only bits[j:i]
-
They also include some weirder x86-specific instructions
-
[16:8] → 32/[8:0]+0
is x86movsx eax, ah
: extract bits 8 through 16 of registerrax
, sign-extend to 32 bits, then zero-extend to 64 bits. -
[32:10] → 32/[22:0]+0
is x86sra eax, 10
: do an arithmetic right shift of the low 32 bits ofrax
, zeroing the upper 32 bits.
-
-
The 6-tuple form is unique: two bit-windowing operations map the same inputs to the same outputs if and only if they have the same 6-tuple representation.
-
The 6-tuple form is closed (*ish) under composition, and compositions are easy (*ish) to compute.
The first (*ish) caveat above is that the composition of two bit-windowing functions is either another bit-windowing function or a constant. For instance, the composition of two nonoverlapping masks is the constant zero, and constant functions are not representable by a 6-tuple. The class of functions consisting of bit-windowing and constant maps is closed under composition, and the full composition function is as follows, whence the second (*ish):
let compose w1 w2 =
let d1 = w1.k - w1.i in
let d2 = w2.k - w2.i in
let i = max w1.i (w2.i - d1) in
let j = min w1.j (w2.j - d1) in
let k = max (w1.k + d2) w2.k in
let l = min (w1.l + d2) w2.l in
assert (l - k = j - i);
let s =
if w2.j <= w1.s then w2.s (* w2 sign extension *)
else w1.s + d2 (* w1 sign extension, w2 extends a zero *)
in
let t = eval w2 w1.t in
if i < j then
Window (make ~i ~j ~k ~l ~s ~t)
else begin
(* No overlap between windows, so probably constant *)
assert (w2.j <= w1.k || w1.l <= w2.i);
if w2.i < w1.s && w1.k < w2.j then
(* Tricky case: Even though windows don't overlap, this isn't constant:
it depends on some sign-extended bits from w1, so we extract just the sign bit *)
Window (make ~i:(j-1) ~j ~k ~l:(k+1) ~s ~t)
else
Const t
end
The intuition here is that you compose two windowing functions W₁ and W₂ by computing the overlap between the output window of W₁ and the input window of W₂. If the windows overlap, then the result maps the input of W₁ to the output of W₂ (suitably narrowed), and if they don't then the output is usually a constant. (The tricky case is when the windows don't overlap, but the output is non-constant because the window of W₂ extracts sign-extension bits of W₁)
(Someday I'll get around to convincing Z3 or Rocq that this composition function works. For now, the evidence is some paper scribbling, plus the fact that it's passed a few million random tests)
The process_pixel
function above is the composition of several shift functions:
[11:5]→32/[ 6:0]+0 ;; Extract 6-bit int bitfield to 32-bit signed int
[ 6:0]→32/[ 8:2]+0 ;; 32-bit left shift by 2
[16:0]→32/[16:0]+0 ;; Conversion to short (sign-extend to 32, zero-extend to 64)
Using the window composition function, this transforms to a single window operation:
[11:5]→32/[ 8:2]+0
The process_grb
function is more complicated. It starts by constructing a
pixel by converting its arguments to bitfields, then applies process_pixel
,
giving:
process_grb(g, r, b) = Wp(Wr(r) | Wg(g) | Wb(b))
where
Wp = [11:5]→32/[ 8: 2]+0 ;; as above
Wr = [ 5:0]→16/[16:11]+0
Wg = [ 6:0]→11/[11: 5]+0
Wb = [ 5:0]→ 5/[ 5: 0]+0
To optimise this, we notice that every window function W
satisfies W(a|b) = W(a) | W(b)
, so we can simplify to:
process_grb(g, r, b) = ((Wp⋅Wr)(r) | (Wp⋅Wg)(g) | (Wp⋅Wb)(b))
Since process_pixel
extracts only the g
component, the two windows Wp⋅Wr
and Wp⋅Wb
are the constant functions 0
, and simplifying x|0 = 0
yields a
single window [6:0]→32/[8:2]+0
.
A few more transformations are possible for general window functions:
-
Bitwise AND distributes over windows just like OR, and XOR works if you zero one side's
T
component (so you don't end up XOR-ing it with itself). -
An operation involving a window and a small constant
W(x) op N
can often be folded into a single window operation, if0 ≤ T op N < 2^k
andop
is addition, subtraction, AND, OR, or XOR.
The above means that any sequence of window functions can be collapsed to a single one, and distributed over bitwise or as needed, so what remains is to generate good code for a single window function.
If T = 0
, then this can be done in two instructions if no sign extension is
involved (that is, if l = s
), or three instructions in general. This becomes 3
or 4 instructions, respectively, to add a nonzero T
afterwards.
In the l = s
case, you can use shift instruction to shift by k - i
(that is,
either left or right as needed to place bit i
of input into position k
of
output), and then use a bitwise-AND instruction to select only the bits from k
to l
.
In the l < s
case, where sign extension occurs, the first step is to shift
left by 64 - j
, putting the bit to be sign-e-xtended into the top bit, and
then to perform an arithmetic right shift by 64 - l
(to move the bits to the
right location, sign extending all the way to the top bit), and finally a third
instruction to select the required bits.
This simple strategy, plus some special cases to detect cases that can be implemented as a single instruction, generates reasonably short code - never more than a few instructions for an arbitrary sequence of shifting, masking and sign- or zero-extension.
Since window functions have an easily computable normal form, it's not much more work to generate optimal code. As always, "optimal" means "optimal with respect to a cost model", and the cost model I'm going to use here is:
-
Window functions are compiled to a sequence of x86_64 instructions, consisting of 32- and 64-bit left/right logical/arithmetic shifts, AND with constant masks,
movzxd
(zero-extension), andmovsxd
(sign-extension). -
Each instruction has equal cost (all are single-cycle-latency instructions on recent machines), except for AND with a mask that does not fit in a 32-bit immediate (as these require an extra instruction to load the immediate). These large mask instructions have a cost greater than one simple instruction but less than two.
Each of the instructions considered can be expressed as a window function itself. This means that a sequence of such instructions can be decompiled back into a 6-tuple, by composing the window functions of each instruction. This makes it easy to establish correctness and optimality:
-
The compilation is correct iff decompiling the code generated for any window function yields the same window function.
-
The compilation is optimal iff decompiling an arbitrary sequence of instructions and compiling the resulting window function yields a sequence of instructions of at most the same cost.
The number of window functions is a bit too big to comfortably test exhaustively, so correctness is tested with a large number of random tests. Optimality can be tested exhaustively, though: the worst case code generated by an arbitrary window is two shifts and a mask, so we need only test code sequences cheaper than that, of which there are only so many.
The optimal code generator ends up about 80 lines of (admittedly tricky) code, written by hacking until the correctness and optimality checkers stop complaining. Each example of nonoptimal code that the optimality checker came up with (that is, code sequences that are cheaper than what the compiler produces) was generalised to a strategy for certain window functions, yielding a number of interesting tricks:
-
Some shift-and-mask operations are cheaper to do as a pair of shifts due to the size of the masks involved, especially if you're willing to mix-and-match 32-bit and 64-bit shifts (and benefit from the implicit zero-extension of the 32-bit ones).
-
When sign-extension is required, there are three different shift counts that should be considered for the arithmetic right shift. (The optimal generator tries all three and picks the cheapest)
-
Shift by 64-l: This is the basic strategy described above, which puts the data in the right place but needs masking.
-
Shift by i: This right-aligns the data. This likely requires a left shift to place it correctly, but may save a mask (sometimes more expensive than a shift).
-
Shift by l-s: This leaves the data in the wrong place, but generates the right number of sign bits, so no masking is needed to remove excess ones. Again, sometimes this can be finished off with a simple shift, saving a mask.
-
-
When sign-extension is required, sometimes it's better to move the sign bit into position 32 and use a 32-bit sign extension, instead of position 64.
-
When both logical shifting and masking are being done, masking should be done before a left shift but after a right shift. (This means that the set bits of the mask are in the lower bit positions both times, which maximises the chance of them fitting in a small immediate)