This is a trick mentioned by Cliff Click from his time at Azul, if you've got a stack you've got a cheap thread local buffer.
// 2MiB aligned 2MiB stack (size of a large page on x86)
enum { STACK_SIZE = 2*1024*1024 };
char* stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_HUGE_2MB | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
because now we can store thread locals at the base and always locate the base by chopping off bits
mov rax, rsp
and rax, -0x200000
... you know where the thread locals are now ...
so long as we have a stack with the same alignment as a size, the base is trivial to find.
we can take it further if we store a safepoint polling page at the bottom:
mov rax, rsp
and rax, -0x200000
test rax, [rax]
which according to UICA has a throughput of around 0.81 cycles, add the cost of L1 (3-4 cycles) and we've got a nice 4-5 ish cycles on a poll. To note, it doesn't take 5 cycles before you can do more things, things can be happening while the load is being dispatched so really you're waiting on ports for 0.81 cycles and then 4 cycles later you'll get a successful load and be happy (or a page is swapped out and you'll get a segfault when we're asking to pause). 1-2 cycles can still be costly if you're in a tight loop so we can improve things further...