Skip to content

Instantly share code, notes, and snippets.

@RealNeGate
Created August 3, 2023 06:58
Show Gist options
  • Save RealNeGate/60cd6ab2a9730b343c8befd5842b29b9 to your computer and use it in GitHub Desktop.
Save RealNeGate/60cd6ab2a9730b343c8befd5842b29b9 to your computer and use it in GitHub Desktop.

I've asked Cliff Click about the caveats after a pause, he's mentioned before that the OS will give you incorrect data:

"after a pause" has a few meanings...

  • The CPU takes an interrupt
    • Perhaps a TLB-fault/null-check - intentially remapping to a Java NPE or a Java low-frequence null-check branch
    • Perhaps for a random stack sample profile
  • The JVM stops at a Safepoint, but the exact mechanism isn't specified
    • Perhaps via a breakpoint instruction (early HotSpot)
    • Perhaps via an explicit test/branch (Azul, maybe later HotSpot)
    • Sometimes from various failed cache checks

On a hardware interrupt:

  • On some X86's the X86 dumps "stack puke" on the user stack, then enters ring0/root
    • And root cleans up
    • Except you took a TLB miss on the stack, then a stack overflow miss, then ran out of swap
    • Somewhere the OS records this thread as blocked
  • Some remote thread e.g. GC thread, asks the kernel for the SP/PC of the blocked thread
  • The kernal has no freak'n clue, due to nested errors, and reports back gargage
  • GC thread crawls garbage, mangles heap.
  • Later the thread recovers then dies due to mangled heap.

On a hardware interrupt:

  • After a CALL but before a PUSH RBP; ADD RSP RBP,#stack_size;
  • Proper kernal action enters user interrupt callback
  • User looks at RPC, hashtable reverses to JIT'd code, gets stack size.
  • Stack not pushed, user lookup thinks it has so adjusts its stack model
  • USer then crawls the stack for e.g. GC or profiling.
  • Since stack is mid-call, user thread barfs and dies

There are several other failure paths. I asked several kernel engineers about the state of reliable cross-thread asking for RSP/RPC and was told "not reliable". It's correct 99.99% of the time. After a few million stack probes, you get it wrong and the JVM crashes. So Azul always manually saves the crawlable stack & pc to mark a frame as crawlable by GC/profiler. There are 2 back-to-back stores, and they MUST happen in order - so some archs require a ST/ST barrier. On the clock-cycle of the 2nd store, you MIGHT get a remote thread crawling the self-stack. So you better be prepared for that.

ST [RSP+tls_pc],RPC // Sequence varies by chip & code-gen.  Can be e.g. #imm store, or a CALL/POP/ST.
ST [RSP+tls_sp],RSP // On the clock cycle, the stack might be crawled

This 2nd store of RSP "unlocks" the stack. It had been locked by the self/user thread. It's now unlocked. A GC might lock it, then crawl it & mutate stack elements above the saved RSP. So to start using your own stack again, you must lock it first!

CAS [RSP+tls_sp],0 // Attempt to self-lock stack.  Will fail if GC has it locked
JNE fail; // Blocked by e.g. GC
...  // And away we go...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment