(Claude Code; Opus 4.6)

Windows Stack Memory Corruption Investigation

Summary

Runtime stack memory corruption on Windows amd64 causes a DEP violation (Exception 0xc0000005 code 0x8) when a goroutine jumps to a corrupted return address. The corruption always overwrites the high 32 bits of a return address, replacing a valid code pointer (e.g. 0x00007ff6XXXXXXXX) with a value whose high 32 bits are a small number (e.g. 0x00000010XXXXXXXX). The low 32 bits are preserved. The corrupted value is a normal heap/stack address that gets written over the return address's upper dword.

The crash was initially reported as a Go 1.26 regression, but testing showed it also reproduces with Go 1.25.0 and Go tip (master). It may have become more frequent in 1.26 due to changes in binary layout or stack usage patterns.

Reproduction

The crash reliably reproduces by running tailscale.com/tsnet tests on Windows amd64 with -test.count=3:

GOOS=windows GOARCH=amd64 go test -c -o tsnet_test.exe ./tsnet/
tsnet_test.exe -test.timeout=90s -test.count=3

The crash typically occurs during TestConn or later tests, in a goroutine running derpserver.(*sclient).run which reads DERP frames via ReadFrameHeader. The crashing goroutine was created by net/http.(*Server).Serve for an HTTPS DERP connection.

The crash was tested on:

Windows 11 build 26200 (12th Gen Intel i7-1255U, hybrid P+E cores)
GitHub Actions Windows runners (Azure VMs, various CPUs)

What We Know For Sure

The corruption pattern

Always the high 32 bits of an 8-byte return address on the goroutine stack
The corrupted return address was pushed by a CALL instruction (to ReadFrameHeader)
The goroutine's RET pops this corrupted value and jumps to it, causing DEP violation
The value written over the high 32 bits is a normal-range address (not a special constant)

Async preemption is required

GODEBUG=asyncpreemptoff=1 prevents the crash (test times out instead)
This was the first and most definitive finding

Stack growth is involved

Setting stackMin=4096 (or higher) prevents the crash (test times out instead)
The default stackMin=2048 allows the crash
The crashing goroutine consistently has stackcopycount of 10-12, meaning its stack was copied/grown 10-12 times during its lifetime
The combination of async preemption + small stacks + stack growth is the trigger

The corruption does NOT happen during PushCall/SetThreadContext

Instrumented preemptM to verify the resumePC was correctly written to the goroutine stack by PushCall and still correct after SetThreadContext
The value was always correct at that point
The corruption occurs later, after the goroutine has been resumed

Not caused by GC or stack shrinking

GOGC=off still crashes
GODEBUG=gcshrinkstackoff=1 still crashes

Not caused by stale/freed stack references

stackPoisonCopy=1 (fills old stack with 0xfd after copy) still crashes with the same pattern (no 0xfd values in the corrupted data)
stackFaultOnFree=1 (maps old stack pages as inaccessible) still crashes with the same pattern (no access violation on old stack pages)

The copy itself appears correct

Added a post-adjustframe verification in copystack that compared every 8-byte value in the new stack against the old stack for the corruption pattern. It did not fire. This means the corruption is not introduced by memmove or adjustframe during the copy itself.

A separate bug exists in cgo callback stack growth

debugCheckBP=true detected 98 instances of invalid frame pointers during copystack when goroutines in Windows cgo callback chains (e.g., the desktop session watcher's pumpThreadMessages -> getMessage -> Windows callback -> wndProc -> destroyWindow -> Windows callback -> callbackWrap) need stack growth
The BP chain in these goroutines crosses from the Go goroutine stack to the Windows system/thread stack. adjustframe encounters a BP value outside the goroutine's stack range.
In the non-debug path, adjustpointer correctly skips adjustment of out-of-range values, so this is "safe" in that it doesn't corrupt data, but it means the BP chain is broken after the stack copy.
This is a SEPARATE bug from the DERP return address corruption. The DERP goroutine has no cgo frames. Both bugs involve copystack but in different goroutines with different stack structures.

What We Tried That Didn't Pan Out

`.pdata` sorting (commit bbed50aaa3)

The Go linker emits .pdata (Windows SEH function table) entries unsorted, violating the PE/COFF spec requirement. Windows RtlLookupFunctionEntry does a binary search on these entries. The sort fix exists on master but is not in Go 1.26.x. However, the crash still occurs with the sort fix applied (tested on master). The .pdata issue is a real bug but is not the cause of this crash.

GetThreadContext return value checking

Added a check for GetThreadContext returning 0 (failure). It never failed.

Stack scanning in preemptM

Added a loop to scan the goroutine stack for the corruption pattern immediately after PushCall + SetThreadContext. No corruption was found at that point, confirming the corruption happens later.

Larger initial stacks

Setting stackMin to 4096, 8192, or 65536 all prevent the crash, but also cause the test to time out. Larger stacks mean less stack growth, which avoids the bug. This doesn't pinpoint the mechanism but confirms stack growth is part of the trigger.

Minimal reproducers

Wrote two minimal Go programs that create many goroutines doing frame-reading over TCP/TLS connections with small stack frames, similar to the DERP server pattern. Neither crashed. The full tsnet test suite is needed to trigger the bug, suggesting it requires a specific combination of goroutine count, stack depth, I/O patterns, and timing.

Theories

Most likely: corruption during stack growth after async preemption resume

The goroutine is deep in a call stack when async-preempted. It yields via asyncPreempt -> asyncPreempt2 -> mcall -> gopreempt_m. When later rescheduled, it returns through asyncPreempt back to the interrupted code. The interrupted code (or a subsequent function call) triggers stack growth via morestack -> newstack -> copystack.

During this stack growth, something goes wrong. The memmove and adjustframe produce correct results (verified), but something after copystack returns uses a stale or incorrect reference to the old stack location, writing data to what is now either freed memory or another goroutine's stack. This write overwrites 4 bytes of a return address with heap/stack address data.

The stackFaultOnFree test should have caught a write to freed stack pages but didn't, which means either:

The old stack pages were immediately reused (returned to the stack pool and given to another goroutine), making them accessible
The corruption is on the NEW stack, not the old one, but the post-copy check missed it (perhaps due to timing - the corruption happens after copystack returns)
The corruption involves a different mechanism entirely

Alternative: Windows thread context interaction

When a goroutine is async-preempted via SuspendThread + SetThreadContext + ResumeThread, and then its stack is grown, there might be a subtle interaction where Windows retains internal references to the old stack (e.g., for APC delivery, exception handling, or thread context restoration) that become stale after the stack moves. This wouldn't be caught by stackFaultOnFree if Windows keeps its own mappings.

Next Steps

Use a real debugger: Set a hardware data breakpoint (DR0-DR3) on the return address location to catch the exact instruction that overwrites it. This requires a Windows debugger (WinDbg) attached to the process.
Add per-goroutine stack-copy tracking with frame validation: After each copystack, walk the new stack's frame pointer chain and validate that all return addresses look like valid code pointers (high bits in the expected image range).
Bisect the stack growth: Instead of growing the stack to a new allocation, try growing it in-place (remap to a larger region) to eliminate the memmove/pointer-adjustment path.
Test with stackNoCache=1: Prevent stack page reuse to see if the corruption changes (would confirm if old stack pages being reused by other goroutines is part of the story).
Investigate the cgo callback BP bug: The 98 invalid-BP-during-copystack warnings are a real bug that should be filed separately. While not the direct cause of the DERP corruption, they indicate that copystack has difficulty with certain frame layouts on Windows.

bradfitz/win-corrupt.md

Select an option

No results found

Select an option

No results found

Windows Stack Memory Corruption Investigation

Summary

Reproduction

What We Know For Sure

The corruption pattern

Async preemption is required

Stack growth is involved

The corruption does NOT happen during PushCall/SetThreadContext

Not caused by GC or stack shrinking

Not caused by stale/freed stack references

The copy itself appears correct

A separate bug exists in cgo callback stack growth

What We Tried That Didn't Pan Out

`.pdata` sorting (commit bbed50aaa3)

GetThreadContext return value checking

Stack scanning in preemptM

Larger initial stacks

Minimal reproducers

Theories

Most likely: corruption during stack growth after async preemption resume

Alternative: Windows thread context interaction

Next Steps

workturnedplay commented Apr 8, 2026

Uh oh!

bradfitz/win-corrupt.md

Windows Stack Memory Corruption Investigation

Summary

Reproduction

What We Know For Sure

The corruption pattern

Async preemption is required

Stack growth is involved

The corruption does NOT happen during PushCall/SetThreadContext

Not caused by GC or stack shrinking

Not caused by stale/freed stack references

The copy itself appears correct

A separate bug exists in cgo callback stack growth

What We Tried That Didn't Pan Out

.pdata sorting (commit bbed50aaa3)

GetThreadContext return value checking

Stack scanning in preemptM

Larger initial stacks

Minimal reproducers

Theories

Most likely: corruption during stack growth after async preemption resume

Alternative: Windows thread context interaction

Next Steps

workturnedplay commented Apr 8, 2026

Uh oh!

`.pdata` sorting (commit bbed50aaa3)