Skip to content

Instantly share code, notes, and snippets.

@RealNeGate
Last active September 13, 2023 16:49
Show Gist options
  • Save RealNeGate/c8b4f5cfa5ab073b6f66875a3d6f6ae1 to your computer and use it in GitHub Desktop.
Save RealNeGate/c8b4f5cfa5ab073b6f66875a3d6f6ae1 to your computer and use it in GitHub Desktop.

This is a trick mentioned by Cliff Click from his time at Azul, if you've got a stack you've got a cheap thread local buffer.

// 2MiB aligned 2MiB stack (size of a large page on x86)
enum { STACK_SIZE = 2*1024*1024 };
char* stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_HUGE_2MB | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

because now we can store thread locals at the base and always locate the base by chopping off bits

mov rax, rsp
and rax, -0x200000
... you know where the thread locals are now ...

so long as we have a stack with the same alignment as a size, the base is trivial to find.

we can take it further if we store a safepoint polling page at the bottom:

mov rax, rsp
and rax, -0x200000
test rax, [rax]

which according to UICA has a throughput of around 0.81 cycles, add the cost of L1 (3-4 cycles) and we've got a nice 4-5 ish cycles on a poll. To note, it doesn't take 5 cycles before you can do more things, things can be happening while the load is being dispatched so really you're waiting on ports for 0.81 cycles and then 4 cycles later you'll get a successful load and be happy (or a page is swapped out and you'll get a segfault when we're asking to pause). 1-2 cycles can still be costly if you're in a tight loop so we can improve things further...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment