I've asked Cliff Click about the caveats after a pause, he's mentioned before that the OS will give you incorrect data:
"after a pause" has a few meanings...
- The CPU takes an interrupt
-
- Perhaps a TLB-fault/null-check - intentially remapping to a Java NPE or a Java low-frequence null-check branch
-
- Perhaps for a random stack sample profile
- The JVM stops at a Safepoint, but the exact mechanism isn't specified
-
- Perhaps via a breakpoint instruction (early HotSpot)
-
- Perhaps via an explicit test/branch (Azul, maybe later HotSpot)
-
- Sometimes from various failed cache checks
On a hardware interrupt:
- On some X86's the X86 dumps "stack puke" on the user stack, then enters ring0/root
-
- And root cleans up
-
- Except you took a TLB miss on the stack, then a stack overflow miss, then ran out of swap
-
- Somewhere the OS records this thread as blocked
- Some remote thread e.g. GC thread, asks the kernel for the SP/PC of the blocked thread
- The kernal has no freak'n clue, due to nested errors, and reports back gargage
- GC thread crawls garbage, mangles heap.
- Later the thread recovers then dies due to mangled heap.
On a hardware interrupt:
- After a CALL but before a PUSH RBP; ADD RSP RBP,#stack_size;
- Proper kernal action enters user interrupt callback
- User looks at RPC, hashtable reverses to JIT'd code, gets stack size.
- Stack not pushed, user lookup thinks it has so adjusts its stack model
- USer then crawls the stack for e.g. GC or profiling.
- Since stack is mid-call, user thread barfs and dies
There are several other failure paths. I asked several kernel engineers about the state of reliable cross-thread asking for RSP/RPC and was told "not reliable". It's correct 99.99% of the time. After a few million stack probes, you get it wrong and the JVM crashes. So Azul always manually saves the crawlable stack & pc to mark a frame as crawlable by GC/profiler. There are 2 back-to-back stores, and they MUST happen in order - so some archs require a ST/ST barrier. On the clock-cycle of the 2nd store, you MIGHT get a remote thread crawling the self-stack. So you better be prepared for that.
ST [RSP+tls_pc],RPC // Sequence varies by chip & code-gen. Can be e.g. #imm store, or a CALL/POP/ST.
ST [RSP+tls_sp],RSP // On the clock cycle, the stack might be crawled
This 2nd store of RSP "unlocks" the stack. It had been locked by the self/user thread. It's now unlocked. A GC might lock it, then crawl it & mutate stack elements above the saved RSP. So to start using your own stack again, you must lock it first!
CAS [RSP+tls_sp],0 // Attempt to self-lock stack. Will fail if GC has it locked
JNE fail; // Blocked by e.g. GC
... // And away we go...