Skip to content

Instantly share code, notes, and snippets.

@RealNeGate
Last active September 14, 2023 14:52
Show Gist options
  • Save RealNeGate/c772bb5544a07321e71992d1e3ab4269 to your computer and use it in GitHub Desktop.
Save RealNeGate/c772bb5544a07321e71992d1e3ab4269 to your computer and use it in GitHub Desktop.

Let's compare Clang and Cuik/TB a bit to get a picture of what i need to do.

Given this C code:

void max_array(size_t n, float* x, float* y) {
    for (size_t i = 0; i < n; i++) {
        x[i] = x[i] > y[i] ? x[i] : y[i];
    }
}

Clang and I generate:

# cuik -O1
max_array:
  xor RAX, RAX
L0000021FE98F03C0:
  mov R9, RAX
  cmp R9, RCX
  jnb L0000021FE98F0540
L0000021FE98F0440:
  movss XMM0, xmmword [R8 + RAX*4]
  maxss XMM0, xmmword [RDX + RAX*4]
  movss xmmword [RDX + RAX*4], XMM0
  add RAX, 1
  jmp L0000021FE98F03C0
L0000021FE98F0540:
  ret



# clang -O1
max_array:
  test    rdi, rdi
  je      .LBB0_3
  xor     eax, eax
.LBB0_2:
  movss   xmm0, dword ptr [rsi + 4*rax]
  maxss   xmm0, dword ptr [rdx + 4*rax]
  movss   dword ptr [rsi + 4*rax], xmm0
  inc     rax
  cmp     rdi, rax
  jne     .LBB0_2
.LBB0_3:
  ret

you'll notice two checks instead of one this is because they do loop rotation simple rundown is this

while (cond) {
  ...
}

// into...

if (cond) {
  // loop invariants are placed here
  //   none in this example :p
  do {
    ...
  } while (cond);
}

the big win is making it easier to do later loop optimizations. their codegen generates inc rax instead of add rax, 1 which doesn't really matter to me. oh and the other one is that my register allocator does not coalesce a copy of the loop index. In practice these loops run at about the same speed but the gap grows in more complex examples. Loop rotation eventually becomes helpful once you need to hoist memory ops from loops:

// broadcast item across all elements
for (int i = 0; i < n; i++) {
  dst[i] = src[0];
}

given the src is a pointer i haven't yet accessed and n can be 0, i can't hoist that load because just placing it outside of the loop means it'll force a load even if n == 0 which could be a problem. but when we do loop rotation...

size_t i = 0;
if (i < n) {
  do {
    dst[i] = src[0];
    i += 1;
  } while (i < n);
}

we can now guarentee that once we're in the loop body we're doing at least one trip so hoisting is now fair game:

+ // if logic proved we do one trip which means we're met the condition to read src[0]
+ size_t hoisted = src[0];
  do {
-   dst[i] = src[0];
+   dst[i] = hoisted;
  } while (i < n);

Yea that's my ted talk. I started implementing this a while back and stopped because I only do optimizer stuff on the side, it's low priority for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment