This section will discuss the loop unroll decision making process (which loops to unroll, how much and why). Loop unrolling is just duplicating a loop body as to do "multiple iterations per iteration" for example:
size_t i = 0;
for (; i < N; i += 2) {
work(i), work(i);
}
// fallback loop, for when N doesn't cleanly go into the unrolled
// loop, in this example the trip count of this second loop is 0-1
// so really it could be an if not a loop.
for (; i < N; i++) {
work(i);
}
This is usually accomplished to allow for better instruction level parallelism (if you interleave the iterations you can potentially get the processor to run the instructions in parallel since they wouldn't be depending on each other in that scenario) or to avoid the overhead of jumping back to the top of the loop (this isn't as problematic anymore).
Let's start by going through a simple pattern that advocates for a 2x unroll (twice iterations per iteration):
// fib
int c = a + b;
a = b;
b = c;
2x unrolled in SSA:
c0 = a0 + b0;
a1 = b0;
b1 = c0;
c2 = a1 + b1;
a2 = b1;
b2 = c1;
VVV
a2 = a0 + b0;
b2 = b0 + a2;
VVV
a += b
b += a
In this example we've simplified the swap-add (sound familiar 😏, it's what XADD though it's normally used for atomic fetch add not fibonacci) in the iteration of fibonacci such that it just does two fancy additions, let's break this down such that we can make a more generic pattern.
// original
int a = 0;
int b = 1;
for (int i = 0; i < n-1; i++) {
int c = a + b;
a = b;
b = c;
}
return b;
// SSA-form
i1 := 0;
a1 := 0;
b1 := 1;
for (;;) {
i2 := phi(i1, i3);
if (i2 >= n-1) break;
a2 := phi(a1, b2);
b2 := phi(b1, c);
c := a2 + b2;
i3 = i2+1;
}
return phi(b1, b2)
You might notice that a2 depends on b2 which exists in the same basic block and indirectly depends on a2 through c, this is a text book case of a swap. We can generate a pattern from it that looks like this:
a' := phi(a, b');
b' := phi(b, c);
c := OPERATION(..., a', ..., b', ...);
in this example there's no restrictions on what OPERATION may be just that it depends on both a' and b', now you may be asking why we care about swaps and it's really simple it's because swaps cancel each other out so if we unroll twice we can kill the swaps and the SSA can statically understand the bindings once more.
swap(a, b) // a is b, b is a
b += a // so this is technically a += b
swap(a, b) // we flip back
b += a // and this is then b += a
as proven earlier in SSA
if we were writing a proper optimization pass for this we'd have some heuristics on the size of OPERATION but other than that, this can make for one good recipe for the loop unroller.
TODO: probably explain what a good size heuristic looks like.