Skip to content

Instantly share code, notes, and snippets.

@RealNeGate
Last active August 9, 2022 01:37
Show Gist options
  • Select an option

  • Save RealNeGate/96526c35158bbb23a9e42060322437fa to your computer and use it in GitHub Desktop.

Select an option

Save RealNeGate/96526c35158bbb23a9e42060322437fa to your computer and use it in GitHub Desktop.

return

Walking through an example

i'll eventually write the rest but here's the gist (get it) of it.

Crap: https://godbolt.org/z/59xTYsK1x

Let's start with a small example:

for (size_t i = 0; i < n; i++) {
  lines += (data[i] == '\n') ? 1 : 0;
}

And now we desugar it to show the important details in a nicer format:

size_t i = 0;
while (i < n) {
  lines += (data[i] == '\n') ? 1 : 0;
  i += 1;
}

For the loop vectorizer to run, we start with a loop and try to verify that it has some amount of "eventual consistency" meaning that we can reorder and run multiple iterations at once and converge on the answer later (i'll explain this verification process later because it can get complex). From there we can try to vectorize by first walking it and seeing what the ideal vector width is for the operations, it's also a complex process which I'll go through but for now let's assume in this case that's 16wide because we're "mainly" operating on characters and the simplest native vector width we can go is 128bit so 128bit/8bit = 16 wide. Now we just walk the IR and widen each step by hand.

// widen
size_t unrolled_part = n & ~15;

size_t i = 0;
while (i < unrolled_part) {
  // originally: 
  //   lines += (data[i] == '\n') ? 1 : 0;
  // but we can widen the comparison by doing a vector compare 
  // then tallying up the results via popcount
  uint8x16_t compare = (uint8x16_load(&data[i]) == uint8x16_set1('\n'));
  lines += popcount(movemask(compare));
  
  // originally:
  //   i += 1;
  // but by doing 16 iterations we just add 16 instead.
  i += 16;
}

// this separate loop is identical to the original loop and acts as the
// fallback because we can't guarentee the loads at the very end of the loop.
//
// this loop can be removed if we can assume that either the count is a multiple of 16
while (i < n) {
  lines += (data[i] == '\n') ? 1 : 0;
  i += 1;
}

Slack space

let's imagine that we could say to a pointer "for any spot that you read from, there's at least N bytes available next to it" what does something like that actually imply? Mainly that we can drop the fallthrough loop since the last iteration can thus be done SIMD and just mask out the bytes we think are garbage, they're still read and operated on in SIMD we just consider some slots undefined. Let's show this example:

// widen
size_t normal_part = n & ~15;
size_t remainder = n & 15;

size_t i = 0;
while (i < n) {
  // this mask is used to cutoff the last piece where we have the garbage data
  uint16_t cutoff = i >= normal_part ? remainder : 0xFFFF;

  uint8x16_t compare = (uint8x16_load(&data[i]) == uint8x16_set1('\n'));
  lines += popcount(movemask(compare) & cutoff);

  i += 16;
}

Blah blah eventual consistency:

something something we have eventual consistency if:

  • Each iteration doesn't write any memory (this one is semi-niche but it's kinda the example i showed)
  • Each iteration that does write memory does so in a way where it doesn't overlap with any other iterations. This includes stuff like a a[i] += b[i] where a and b must not alias to achieve eventual consistency.
  • Something else i didn't think about.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment