click_pause_problems.md

I've asked Cliff Click about the caveats after a pause, he's mentioned before that the OS will give you incorrect data:

"after a pause" has a few meanings...

The CPU takes an interrupt
- Perhaps a TLB-fault/null-check - intentially remapping to a Java NPE or a Java low-frequence null-check branch
- Perhaps for a random stack sample profile
The JVM stops at a Safepoint, but the exact mechanism isn't specified
- Perhaps via a breakpoint instruction (early HotSpot)
- Perhaps via an explicit test/branch (Azul, maybe later HotSpot)
- Sometimes from various failed cache checks

On a hardware interrupt:

On some X86's the X86 dumps "stack puke" on the user stack, then enters ring0/root
- And root cleans up
- Except you took a TLB miss on the stack, then a stack overflow miss, then ran out of swap
- Somewhere the OS records this thread as blocked
Some remote thread e.g. GC thread, asks the kernel for the SP/PC of the blocked thread
The kernal has no freak'n clue, due to nested errors, and reports back gargage
GC thread crawls garbage, mangles heap.
Later the thread recovers then dies due to mangled heap.

On a hardware interrupt:

After a CALL but before a PUSH RBP; ADD RSP RBP,#stack_size;
Proper kernal action enters user interrupt callback
User looks at RPC, hashtable reverses to JIT'd code, gets stack size.
Stack not pushed, user lookup thinks it has so adjusts its stack model
USer then crawls the stack for e.g. GC or profiling.
Since stack is mid-call, user thread barfs and dies

There are several other failure paths. I asked several kernel engineers about the state of reliable cross-thread asking for RSP/RPC and was told "not reliable". It's correct 99.99% of the time. After a few million stack probes, you get it wrong and the JVM crashes. So Azul always manually saves the crawlable stack & pc to mark a frame as crawlable by GC/profiler. There are 2 back-to-back stores, and they MUST happen in order - so some archs require a ST/ST barrier. On the clock-cycle of the 2nd store, you MIGHT get a remote thread crawling the self-stack. So you better be prepared for that.

ST [RSP+tls_pc],RPC // Sequence varies by chip & code-gen.  Can be e.g. #imm store, or a CALL/POP/ST.
ST [RSP+tls_sp],RSP // On the clock cycle, the stack might be crawled

This 2nd store of RSP "unlocks" the stack. It had been locked by the self/user thread. It's now unlocked. A GC might lock it, then crawl it & mutate stack elements above the saved RSP. So to start using your own stack again, you must lock it first!

CAS [RSP+tls_sp],0 // Attempt to self-lock stack.  Will fail if GC has it locked
JNE fail; // Blocked by e.g. GC
...  // And away we go...

RealNeGate/click_pause_problems.md