(Claude Code; Opus 4.6)
Runtime stack memory corruption on Windows amd64 causes a DEP violation (Exception
0xc0000005 code 0x8) when a goroutine jumps to a corrupted return address. The
corruption always overwrites the high 32 bits of a return address, replacing a valid
code pointer (e.g. 0x00007ff6XXXXXXXX) with a value whose high 32 bits are a
small number (e.g. 0x00000010XXXXXXXX). The low 32 bits are preserved. The
corrupted value is a normal heap/stack address that gets written over the return
address's upper dword.
The crash was initially reported as a Go 1.26 regression, but testing showed it also reproduces with Go 1.25.0 and Go tip (master). It may have become more frequent in 1.26 due to changes in binary layout or stack usage patterns.
The crash reliably reproduces by running tailscale.com/tsnet tests on Windows
amd64 with -test.count=3:
GOOS=windows GOARCH=amd64 go test -c -o tsnet_test.exe ./tsnet/
tsnet_test.exe -test.timeout=90s -test.count=3
The crash typically occurs during TestConn or later tests, in a goroutine running
derpserver.(*sclient).run which reads DERP frames via ReadFrameHeader. The
crashing goroutine was created by net/http.(*Server).Serve for an HTTPS DERP
connection.
The crash was tested on:
- Windows 11 build 26200 (12th Gen Intel i7-1255U, hybrid P+E cores)
- GitHub Actions Windows runners (Azure VMs, various CPUs)
- Always the high 32 bits of an 8-byte return address on the goroutine stack
- The corrupted return address was pushed by a CALL instruction (to
ReadFrameHeader) - The goroutine's RET pops this corrupted value and jumps to it, causing DEP violation
- The value written over the high 32 bits is a normal-range address (not a special constant)
GODEBUG=asyncpreemptoff=1prevents the crash (test times out instead)- This was the first and most definitive finding
- Setting
stackMin=4096(or higher) prevents the crash (test times out instead) - The default
stackMin=2048allows the crash - The crashing goroutine consistently has
stackcopycountof 10-12, meaning its stack was copied/grown 10-12 times during its lifetime - The combination of async preemption + small stacks + stack growth is the trigger
- Instrumented
preemptMto verify the resumePC was correctly written to the goroutine stack byPushCalland still correct afterSetThreadContext - The value was always correct at that point
- The corruption occurs later, after the goroutine has been resumed
GOGC=offstill crashesGODEBUG=gcshrinkstackoff=1still crashes
stackPoisonCopy=1(fills old stack with 0xfd after copy) still crashes with the same pattern (no 0xfd values in the corrupted data)stackFaultOnFree=1(maps old stack pages as inaccessible) still crashes with the same pattern (no access violation on old stack pages)
- Added a post-
adjustframeverification incopystackthat compared every 8-byte value in the new stack against the old stack for the corruption pattern. It did not fire. This means the corruption is not introduced bymemmoveoradjustframeduring the copy itself.
debugCheckBP=truedetected 98 instances of invalid frame pointers duringcopystackwhen goroutines in Windows cgo callback chains (e.g., the desktop session watcher'spumpThreadMessages->getMessage-> Windows callback ->wndProc->destroyWindow-> Windows callback ->callbackWrap) need stack growth- The BP chain in these goroutines crosses from the Go goroutine stack to the
Windows system/thread stack.
adjustframeencounters a BP value outside the goroutine's stack range. - In the non-debug path,
adjustpointercorrectly skips adjustment of out-of-range values, so this is "safe" in that it doesn't corrupt data, but it means the BP chain is broken after the stack copy. - This is a SEPARATE bug from the DERP return address corruption. The DERP
goroutine has no cgo frames. Both bugs involve
copystackbut in different goroutines with different stack structures.
The Go linker emits .pdata (Windows SEH function table) entries unsorted,
violating the PE/COFF spec requirement. Windows RtlLookupFunctionEntry does a
binary search on these entries. The sort fix exists on master but is not in Go
1.26.x. However, the crash still occurs with the sort fix applied (tested on
master). The .pdata issue is a real bug but is not the cause of this crash.
Added a check for GetThreadContext returning 0 (failure). It never failed.
Added a loop to scan the goroutine stack for the corruption pattern immediately
after PushCall + SetThreadContext. No corruption was found at that point,
confirming the corruption happens later.
Setting stackMin to 4096, 8192, or 65536 all prevent the crash, but also cause
the test to time out. Larger stacks mean less stack growth, which avoids the bug.
This doesn't pinpoint the mechanism but confirms stack growth is part of the
trigger.
Wrote two minimal Go programs that create many goroutines doing frame-reading over TCP/TLS connections with small stack frames, similar to the DERP server pattern. Neither crashed. The full tsnet test suite is needed to trigger the bug, suggesting it requires a specific combination of goroutine count, stack depth, I/O patterns, and timing.
The goroutine is deep in a call stack when async-preempted. It yields via
asyncPreempt -> asyncPreempt2 -> mcall -> gopreempt_m. When later
rescheduled, it returns through asyncPreempt back to the interrupted code. The
interrupted code (or a subsequent function call) triggers stack growth via
morestack -> newstack -> copystack.
During this stack growth, something goes wrong. The memmove and adjustframe
produce correct results (verified), but something after copystack returns uses a
stale or incorrect reference to the old stack location, writing data to what is now
either freed memory or another goroutine's stack. This write overwrites 4 bytes of
a return address with heap/stack address data.
The stackFaultOnFree test should have caught a write to freed stack pages but
didn't, which means either:
- The old stack pages were immediately reused (returned to the stack pool and given to another goroutine), making them accessible
- The corruption is on the NEW stack, not the old one, but the post-copy check
missed it (perhaps due to timing - the corruption happens after
copystackreturns) - The corruption involves a different mechanism entirely
When a goroutine is async-preempted via SuspendThread + SetThreadContext +
ResumeThread, and then its stack is grown, there might be a subtle interaction
where Windows retains internal references to the old stack (e.g., for APC delivery,
exception handling, or thread context restoration) that become stale after the
stack moves. This wouldn't be caught by stackFaultOnFree if Windows keeps its
own mappings.
-
Use a real debugger: Set a hardware data breakpoint (DR0-DR3) on the return address location to catch the exact instruction that overwrites it. This requires a Windows debugger (WinDbg) attached to the process.
-
Add per-goroutine stack-copy tracking with frame validation: After each
copystack, walk the new stack's frame pointer chain and validate that all return addresses look like valid code pointers (high bits in the expected image range). -
Bisect the stack growth: Instead of growing the stack to a new allocation, try growing it in-place (remap to a larger region) to eliminate the memmove/pointer-adjustment path.
-
Test with
stackNoCache=1: Prevent stack page reuse to see if the corruption changes (would confirm if old stack pages being reused by other goroutines is part of the story). -
Investigate the cgo callback BP bug: The 98 invalid-BP-during-copystack warnings are a real bug that should be filed separately. While not the direct cause of the DERP corruption, they indicate that
copystackhas difficulty with certain frame layouts on Windows.
it's been fixed in Go 1.26.2 (seemingly latest) as golang/go@1a44be4 though I only tested my own program(s) with the master branch(1.27 devel) which had the following commit as well(currently no Go release has this yet): golang/go@6ab37c1
My Thanks to everyone involved!