Skip to content

Instantly share code, notes, and snippets.

@KristofferC
Created June 15, 2026 11:13
Show Gist options
  • Select an option

  • Save KristofferC/ec195fdaa333c49a58ac9ddb18f87dea to your computer and use it in GitHub Desktop.

Select an option

Save KristofferC/ec195fdaa333c49a58ac9ddb18f87dea to your computer and use it in GitHub Desktop.

Why Julia's atomic modify doesn't fold to atomicrmw (and when it actually does)

Investigation notes for mwe.jl. Versions tested: 1.12.6, 1.13.0-rc1, 1.14.0-DEV.2212 (062a90bc8c). Source references are to julia master (6e40a4a448, 2026-06-08).

TL;DR

The MWE's code_llvm-based counting is partly misled by a reflection artifact. The real situation, measured on native code of the actual JIT output (code_native(...; dump_module=false)):

case 1.12 native 1.13+/nightly native
unsafe_modify!(Ptr{Int}, +) cmpxchg + call lock add (fixed)
@atomic AtomicMemory{Int}[i] += v cmpxchg + call lock add (fixed)
unsafe_modify!(Ptr{Float64}, +) cmpxchg + call cmpxchg + call (still broken)
@atomic AtomicMemory{Float64}[i] += v cmpxchg + call cmpxchg + call (still broken)
code_llvm (any of the above) loop + call loop + call (misleading for Int!)

So:

  1. Int was fixed in 1.13 by PR #57010 — but @code_llvm/@code_native dump_module=true don't show it (open issue #59645; exact mechanism pinned down below, not yet on the issue).
  2. Float64 — the Ferrite case — is still broken on all versions including nightly. No atomicrmw fadd, and the + is not even inlined into the CAS loop, so the register-pressure cost from the opaque call persists.
  3. On 1.10–1.12 nothing can be fixed retroactively; the manual load/unsafe_replace! CAS loop stays the right workaround there (it works because the old + v call exists in Julia IR, where the Julia inliner can reach it).
  4. Update: both gaps are fixed by the prototype on the kc/atomicmodify-alwaysinline branch in ~/julia — Float64 += now compiles to a single atomicrmw fadd, and unmatchable ops inline into the CAS loop. See "Fix sketches" below for results and caveats.

Root cause, layer by layer

The architectural problem (all versions)

The op of modifyfield!/memoryrefmodify!/atomic_pointermodify is a function value passed to a builtin; the call op(old, v) happens inside the intrinsic's semantics. The CAS retry loop is synthesized in C++ codegen (typed_store in src/cgutils.cpp) and never exists as Julia IR, so the Julia-level inliner cannot inline + into it. The optimizer only devirtualizes: handle_modifyop!_call! (Compiler/src/ssair/inlining.jl:1480) rewrites the call to :invoke_modify carrying a resolved CodeInstance for +(T,T), which codegen turns into a direct specsig call (julia_+_NNN) — specialized, but a call.

On the LLVM side, Julia's pipeline deliberately contains no general inliner — only AlwaysInlinerPass (src/pipeline.cpp:632); inlining decisions belong to the Julia inliner. So an inlinehint function emitted into the module is never inlined by LLVM either. That's why the call survives everywhere it appears, and why the hand-written cas_f64! is fine (its + is inlined by the Julia inliner before codegen, leaving a plain fadd in Julia IR).

This was understood from the start: PRs #41859/#42017 (vtjnash, 2021) created the codegen CAS-loop lowering and explicitly deferred "the invoke part".

What 1.13 added (PR #57010, vtjnash, merged May 2025)

For atomic modify, codegen now emits a fake call

%r = call {i64,i64} @julia.atomicmodify.i64.p0(ptr %ptr, ptr @helper, i8 ord, i8 ssid, ...)

where @helper is a tiny private alwaysinline function built by emit_modifyhelper (src/codegen.cpp:7130) whose body invokes the resolved op. Because cross-module IR isn't visible to LLVM, the same PR added emit_always_inline (src/codegen.cpp:10318): the JIT (src/jitlayers.cpp:514) and AOT both emit the op's definition (private linkage + inlinehint) into the same module.

A new late pass, ExpandAtomicModifyPass (src/llvm-expand-atomic-modify.cpp, scheduled after LateLowerGCPass, src/pipeline.cpp:582), then calls InlineFunction itself and pattern-matches the op body (patternMatchAtomicRMWOp) to a real atomicrmw add/sub/and/or/xor/nand/min/max/fadd/fsub/fmin/fmax/xchg; if it can't match, it expands to a cmpxchg loop (block names atomicrmw.start/atomicrmw.end — these names appear only on this fallback path; an actual atomicrmw has no such blocks). This works for Int: native code is a single lock add.

Why Float64 still fails (three stacked reasons)

  1. Gated out of the new path. The julia.atomicmodify emission in typed_store requires !isboxed && realelty == elty && !intcast && elty->isIntegerTy() && ... (src/cgutils.cpp:2697). Any FP element type takes the intcast route (src/cgutils.cpp:2625-2631: elty becomes iN, an intcast alloca is created), so !intcast fails and Float64 falls into the legacy CAS loop (xchg/done_xchg blocks — note: not atomicrmw.start; on 1.13+ only the Int path gets those names, in the reflection fallback).

  2. The legacy loop's call can't be inlined by anyone. The loop body calls the op via emit_invoke(..., always_inline=true) (src/cgutils.cpp:2535), and emit_always_inline does put a private define fastcc double @julia_+_NNN (a single fadd!) into the module — but tagged only InlineHint (src/codegen.cpp:10380). Since the pipeline has no general inliner and AlwaysInlinerPass only handles alwaysinline, the one-instruction function is never inlined. Native code keeps movabs r15, offset "+"; ... call r15 inside the loop.

  3. Even via the new path, the pass can't match FP yet. The helper for an FP type would be iN → bitcast → fadd → bitcast → iN, and patternMatchAtomicRMWOp cannot look through bitcasts — explicit TODOs at src/llvm-expand-atomic-modify.cpp:213 and :321 ("be able to ignore some simple bitcasts (particularly f64 to i64)"). The pass's FAdd/FSub/FMin/FMax arms are currently unreachable from Julia-emitted code. (Its cmpxchg fallback helper already handles FP via bitcast, and LLVM supports atomicrmw fadd since LLVM 9, so the backend is not the problem.)

Why code_llvm lies about Int on 1.13+ (refines open issue #59645)

jl_get_llvmf_defn (reflection) emits with the non-swiftcc ABI: functions get their pgcstack via a call %pgcstack = call ptr @julia.get_pgcstack(), and that intrinsic is declared with no attributes at that point in the pipeline. When ExpandAtomicModifyPass pattern-matches the op, its canReorderWithRMW check (src/llvm-expand-atomic-modify.cpp:177) conservatively scans the op's whole body for memory writes; the unattributed julia.get_pgcstack() call counts as may-write, so the match is rejected and reflection shows the fallback loop + call fastcc @julia_+_0. Verified with JULIA_LLVM_ARGS="--print-before=ExpandAtomicModify --print-module-scope":

  • real JIT module: op is swiftcc with gcstack as an argument, body is a bare add → pass fires → atomicrmw add;
  • reflection module: same op body plus one dead julia.get_pgcstack() call → pass bails.

So on 1.13+ the only trustworthy view is code_native(...; dump_module=false) (disassembles the actually-JITted code) or code_llvm(...; raw=true) before optimization.

Prior art (issues / PRs)

  • #41843 "Atomic read-modify-write gets much slower with the new API" (2021) — the canonical perf issue; closed July 2025 as fixed by #57010.
  • #41563, #41859, #42017 (vtjnash, 2021) — atomic intrinsic codegen; introduced the CAS-loop lowering, op call left unoptimized knowingly ("within 2x" of Threads.Atomic).
  • #46971 (aviatesk, 2022) — ModifyOpInfo/:invoke_modify plumbing; why the op is a specialized call rather than jl_apply_generic.
  • #45122 (tkf, 2022, closed stale 2026) — alternative design: recognize known ops in Julia IR/codegen and emit atomicrmw directly; vtjnash objected to dispatch-in-codegen; superseded by #57010.
  • #44932/#45147 (tkf) — same story for @atomicswap on boxed fields; fixed with atomicrmw xchg on pointers + carried LLVM patch.
  • #57010 (vtjnash, merged May 2025 → 1.13) — julia.atomicmodify + ExpandAtomicModifyPass + emit_always_inline. Fixed integer ops in real codegen. In-source TODOs: upstream into LLVM AtomicExpand, bitcast peeking (the FP blocker), inline-cost heuristics, longer sequences.
  • #50980 "Atomic operations don't generate optimal LLVM code" (2023) — still open, zero comments; now accurate only for FP + reflection.
  • #59645 "Codegen introspection misleading for atomic_max!" (2025) — still open; the reflection artifact. The pgcstack/canReorderWithRMW mechanism above is not yet recorded there.
  • #61583 — sysimage build slowdown caused by #57010's linking rework (collateral, fixed).

Fix sketches

Ordered by leverage/effort for the FP case:

  • (A) Minimal, fixes the register pressure for every unmatched op: in emit_always_inline, tag the emitted op Attribute::AlwaysInline instead of InlineHint (src/codegen.cpp:10380). AlwaysInlinerPass runs early in the pipeline (src/pipeline.cpp:632, on by default), so the legacy CAS loop would get the fadd inlined — matching the hand-written cas_f64!. Risk: these targets are only created for modify ops today, but a large op body would be force-inlined; could gate on inlining cost (jl_ir_inlining_cost is already consulted in jl_get_method_ir, src/codegen.cpp:10303).

    Tried it (branch kc/atomicmodify-alwaysinline in ~/julia, one-line diff on top of 6e40a4a448) — it works. Native code on the patched build (code_native dump_module=false):

    case before after
    unsafe_modify!(Ptr{Int}, +) lock add lock add (unchanged)
    unsafe_modify!(Ptr{Float64}, +) cmpxchg + call "+" cmpxchg + inlined vaddsd, no call
    @atomic AtomicMemory{Float64}[1] += v cmpxchg + call cmpxchg + inlined vaddsd (only remaining call: cold ijl_bounds_error_int)
    unsafe_modify!(Ptr{Float64}, addtwice) (user op x + 2v) cmpxchg + call cmpxchg + 2 inlined vaddsd, no call

    The patched f64 loop is byte-for-byte the shape of the hand-written cas_f64!: vmovq/vaddsd/vmovq/lock cmpxchg/jne with no spills. The MWE dead-branch benchmark gap disappears (11.07 vs 10.76 ms, within variance), threaded @atomic += correctness checks pass, and test/atomics.jl passes. Two caveats for a real PR:

    • addFnAttr(AlwaysInline) is applied on both paths in emit_always_inline, including the case where the op was already emitted as a regular (externally visible) function in the same compile unit — in AOT that marks a shared definition alwaysinline module-wide. Should be restricted to the freshly-emitted private copy, or gated on jl_ir_inlining_cost.
    • Reflection (@code_llvm) for Int now shows the fallback loop calling the private helper @0 instead of julia_+ — cosmetically different, same #59645 artifact (the unattributed julia.get_pgcstack() call in the non-swiftcc reflection ABI still defeats patternMatchAtomicRMWOp). Real JIT output is unaffected. Fix (D) would clean this up.
    • This gets the op inlined into the loop; it does not produce atomicrmw fadd. That still needs (B)/(C).
  • (B) Real fix, gets atomicrmw fadd: teach patternMatchAtomicRMWOp to peek through size-preserving bitcasts (the in-source TODOs at llvm-expand-atomic-modify.cpp:213/:321), and relax the typed_store gate (src/cgutils.cpp:2697) to admit intcast && intcast_eltyp->isFloatingPointTy()emit_modifyhelper needs to bitcast the iN argument to the FP type before invoking the op. The pass's cmpxchg fallback already does the FP bitcasting dance, so only the matcher and the gate are missing.

  • (C) Alternative to B: emit julia.atomicmodify.f64 with the FP type directly and let the pass operate on FP values (its createWeakCmpXchgInstFun already bitcasts for the cmpxchg); avoids matching through bitcasts but touches more of the intcast machinery in typed_store.

    Implemented (C) — it works, and no bitcast-peeking was needed. It turns out the pass's own test suite (test/llvmpasses/atomic-modify.ll) already declares julia.atomicmodify.f64 with FP ops and expects atomicrmw fadd — the FP-typed form was the anticipated design; codegen just never emitted it. Changes on kc/atomicmodify-alwaysinline (on top of fix A):

    • src/cgutils.cpp (typed_store fast path): gate becomes (intcast ? intcast_eltyp->isFloatingPointTy() : elty->isIntegerTy()); the helper, the pseudo-intrinsic name (julia.atomicmodify.f64.p0 etc.), and the {T,T} return struct use the FP type directly. emit_modifyhelper needed no changes.
    • src/llvm-expand-atomic-modify.cpp: see "latent bugs" below.

    code_llvm on the patched build (reflection now truthful thanks to the pgcstack fix): unsafe_modify!(Ptr{Float64}, +), Float32, -, @atomic AtomicMemory{Float64}[1] += v, and @atomic a.x += v on an atomic field all produce a single atomicrmw fadd/fsub (plus the cold boundserror branch for AtomicMemory). An unmatchable op (weird(x,v) = x > 0 ? x+v : x-v) becomes a cmpxchg loop with the body inlined, zero calls. x86_64 native code for fadd is still a cmpxchg loop (no FP-RMW instruction exists; LLVM's backend legalizes it), so the runtime win vs fix A is on targets with native FP atomics (GPUs, ARM LSFE); the IR win is canonicality and optimizability.

    Validation: test/atomics.jl 12145 pass, test/intrinsics.jl 536 pass, test/llvmpasses/atomic-modify.ll passes (two CHECK lines updated: the reversed-sub tests now expect the inlined sub instead of an opaque call — the improvement, not a regression), threaded += and old/new pair correctness checks pass.

    Latent pass bugs found and fixed along the way (all in src/llvm-expand-atomic-modify.cpp; the first one initially broke 5 test/intrinsics.jl cases). Full writeups with repros and reachability analysis in PASS_BUGS.md:

    1. Throwing ops could commit a garbage store. For a type-unstable op (e.g. unsafe_modify!(p::Ptr{Int8}, (x,v)->v, UInt32(1)), which must throw TypeError), the helper ends in the typecheck-throw with no ret. Once fix A pre-inlines the op, the loaded-value argument becomes unused and patternMatchAtomicRMWOp hit its use_empty → Xchg shortcut before checking a return exists, emitting atomicrmw xchg ptr, poison ahead of the throw. Fixed by scanning for the unique ret first.
    2. Use-before-def for computed Xchg valuessegfaults stock 1.13.0-rc1 and nightly (LLVM RegisterCoalescer crash; 1.12 is fine): atomic_pointermodify(p, (o,v)->2v, 3, :monotonic). The placeholder RMW stays before the inlined code that defines its value operand. Fixed by moving the RMW to the original modify site — which also guarantees a throwing op never reaches the store. 5-line repro in PASS_BUGS.md; should be filed upstream promptly (release-relevant for 1.13).
    3. std::get<BinOp> on a false variant (bad_variant_access) when inlining+InstSimplify folds the op into an unconvertible shape. Fixed by recreating the op call and taking the cmpxchg-loop fallback.
    4. FP identity ops (fadd x, -0.0 folds to x) hit the fence conversion atomicrmw or %p, 0, which is integer-only — now emitted on the equivalent integer type with the old value bitcast back.
    5. Unmatchable ops left an opaque call in the fallback loop (the pass only ran InlineFunction on the RMW-attempt path, and AlwaysInlinerPass runs earlier in the pipeline than this pass). The fallback now inlines the op body into the loop when the callee is a definition.
  • (D) Reflection fidelity (#59645): either give the reflection-ABI julia.get_pgcstack declaration proper attributes (memory(none)/nosync-equivalent) so canReorderWithRMW ignores it, or have canReorderWithRMW special-case known Julia intrinsics. Small, self-contained.

Nothing here is backportable to 1.12 (the whole #57010 machinery is absent); Ferrite's manual CAS loop remains correct for 1.10–1.12, and on stock 1.13+ it is still the better codegen for Float64 until (A)/(C) land upstream.

The combined prototype (A + C + the pass fixes) lives on the kc/atomicmodify-alwaysinline branch in ~/julia (uncommitted working tree, ~84 lines over 4 files: src/cgutils.cpp, src/codegen.cpp, src/llvm-expand-atomic-modify.cpp, test/llvmpasses/atomic-modify.ll). Before PRing: restrict the AlwaysInline attr to the freshly emitted private copy (or gate on jl_ir_inlining_cost), add FP + noreturn + identity cases to test/llvmpasses/atomic-modify.ll, and file the pass bugs from PASS_BUGS.md — bug 2 there segfaults stock 1.13.0-rc1/nightly and deserves its own immediate issue + one-line fix.

Draft GitHub issue (not filed)

Title: @atomic x[i] += v / unsafe_modify! on Float64 still compiles to a CAS loop with a non-inlined call to + (no atomicrmw fadd)

Since #57010 (1.13), integer atomic modify correctly compiles to a single atomicrmw instruction (lock add on x86_64) in actual JIT output. Floating-point modify does not — on 1.13.0-rc1 and 1.14.0-DEV.2212 it still produces the pre-#57010 lowering: a cmpxchg loop that calls the op through a function pointer:

modify_f64!(p::Ptr{Float64}, v::Float64) =
    (Base.unsafe_modify!(p, +, v, :monotonic); nothing)
# same for: @atomic :monotonic m[1] += v with m::AtomicMemory{Float64}
; code_native(modify_f64!, (Ptr{Float64}, Float64); dump_module=false), 1.14.0-DEV.2212
        movabs  r15, offset "+"
L48:    vmovq   xmm0, r14
        vmovsd  xmm1, qword ptr [rbp - 32]
        call    r15                      ; <- + not inlined, in the loop
        vmovq   rcx, xmm0
        mov     rax, r14
        lock cmpxchg qword ptr [rbx], rcx
        mov     r14, rax
        jne     L48

Expected: atomicrmw fadd (supported since LLVM 9), or at minimum the fadd inlined into the loop (a hand-written unsafe_replace! CAS loop compiles to exactly that, since the Julia inliner sees the + call there).

Besides throughput of the atomic itself, the opaque call has a second-order cost: when the modify sits in a rarely/never-taken branch of a hot loop, values live across the potential call site must be kept in callee-saved registers or spilled. In Ferrite.jl's sparse-assembly inner loop this costs a stable ~20%; the workaround is a manual load/unsafe_replace! loop.

Why it happens (refs to current master):

  1. FP element types are gated out of the julia.atomicmodify path: typed_store requires !intcast && elty->isIntegerTy() (src/cgutils.cpp:2697), and FP always sets intcast (src/cgutils.cpp:2625). So Float64 takes the legacy CAS-loop emission.
  2. In the legacy path the op is emitted into the module by emit_always_inline (a private single-fadd function!) but only with inlinehint (src/codegen.cpp:10380), and Julia's LLVM pipeline has no general inliner — only AlwaysInlinerPass (src/pipeline.cpp:632) — so nothing ever inlines it.
  3. Even if routed through ExpandAtomicModifyPass, the matcher cannot yet look through the i64f64 bitcasts; the FAdd/FSub/FMin/FMax arms are currently dead code (TODOs at src/llvm-expand-atomic-modify.cpp:213,321).

Possible fixes: (a) mark emit_always_inline-emitted ops alwaysinline (cheap; fixes the register-pressure/call problem for all unmatched ops); (b) implement the bitcast-peeking TODO and relax the typed_store gate so FP reaches the pass and folds to atomicrmw fadd.

Related: #57010 (the integer fix), #50980 (open; now FP-specific), #41843 (closed by #57010), #59645 (separate issue: @code_llvm shows the fallback loop even for Int because the reflection ABI's unattributed julia.get_pgcstack() call defeats the pass's canReorderWithRMW scan — happy to file details there).

Corrections to mwe.jl's comments

  • "identical on 1.12.6, 1.13.0-rc1, 1.14.0-DEV" — only true of code_llvm output. Real native code differs: Int is fixed on 1.13+.
  • "names the expansion blocks atomicrmw.start without emitting any atomicrmw" — applies to the Int case in reflection only; the f64 path keeps the old xchg/done_xchg blocks on all versions.

Latent bugs in ExpandAtomicModifyPass (src/llvm-expand-atomic-modify.cpp)

Found while extending the pass to FP types (see NOTES.md). The pass was added in PR #57010 (Julia 1.13). Source references are to master 6e40a4a448; "the branch" = kc/atomicmodify-alwaysinline in ~/julia, where all of these are fixed.

Severity summary:

# What Reachable on stock? Severity
1 Throwing op commits atomicrmw xchg ptr, poison before the throw latent (needs op pre-inlined into helper) memory corruption
2 Use-before-def for computed Xchg values yes — segfaults 1.13.0-rc1 and nightly compiler crash
3 std::get on false variant (bad_variant_access) no known repro defensive
4 FP identity op emits invalid atomicrmw or double only with FP gate relaxed invalid IR
5 Unmatchable op leaves opaque call in fallback loop yes missed optimization

Bug 2 first — it crashes stock Julia, file this one

Repro (5 lines, segfaults 1.13.0-rc1 and 1.14.0-DEV.2212; fine on 1.12):

double2(o, v) = 2v
h(p, v) = Core.Intrinsics.atomic_pointermodify(p, double2, v, :monotonic)
r = Ref(5)
GC.@preserve r begin
    h(Base.unsafe_convert(Ptr{Int}, r), 3)   # SIGSEGV in LLVM RegisterCoalescer
end

Crash is inside LLVM during JIT compilation (LiveRange::QueryRegisterCoalescer::reMaterializeTrivialDefcompileModule, jitlayers.cpp:1414) — i.e. broken IR reached the backend. Release builds run no IR verifier, so the use-before-def goes straight to ISel.

Mechanism. When the op ignores the loaded value, patternMatchAtomicRMWOp returns Xchg via the Old->use_empty() case. In the expansion (expandAtomicModifyToCmpXchg, the RMWOp == AtomicRMWInst::Xchg / ValOp == nullptr arm at src/llvm-expand-atomic-modify.cpp:415), the value to store is NewVal — the op's return value, produced by the inlined op body. But the placeholder atomicrmw was created before the call site (line 371-372), so after inlining, its value operand is defined after it:

%rmw = atomicrmw xchg ptr %p, i64 %mul   ; uses %mul before it exists
%mul = shl i64 %v, 1                     ; inlined op body

Any op of the shape (old, v) -> f(v) with a non-constant result triggers it (Returns(constant) is fine — constants dominate everything — which is why the existing tests pass).

Fix on the branch: move the placeholder RMW to the original julia.atomicmodify call site (RMW->moveBeforePreserving(Modify.getIterator())) before wiring Val = NewVal. That position is after all inlined op code, so operands dominate — and as a bonus the store provably happens after the op evaluation, which also closes the throwing-op corner of bug 1.


Bug 1: throwing ops can commit a poison store (memory corruption)

Symptom. A modify whose op must throw (the result fails the eltype check) writes garbage to the target before throwing. Observed as 5 failures in test/intrinsics.jl (line 443: memory contained junk after a @test_throws TypeError atomic_pointermodify(p::Ptr{Int8}, (x,v)->v, UInt32(1), ...)).

Mechanism. emit_modifyhelper (src/codegen.cpp:7130) appends emit_typecheck to the helper; when inference proves the op's return type never matches the element type, the typecheck is an unconditional throw and the helper has no ret (body ends in unreachable). If, additionally, the helper's loaded-value argument is unused, patternMatchAtomicRMWOp takes the Old->use_empty() → Xchg shortcut (line 222-226) — which runs before the unique-ret scan — and the expansion then inlines the noreturn helper, recovers NewVal == poison from the freeze, and commits atomicrmw xchg ptr %p, poison in the still-reachable code ahead of the throw.

Reachability. On stock master the op is a separate inlinehint function called from the helper, so the loaded argument is "used" by that call and the shortcut doesn't fire; the no-ret scan then rejects the helper. It becomes reachable as soon as the op's body is visible in the helper before the pass runs — e.g. the branch's AlwaysInline change, or any future improvement that pre-inlines the op (also plausibly via a const-return op whose CodeInstance is already compiled, though that variant did not reproduce on stock nightly). The hole is in the matcher either way.

Fix on the branch: in patternMatchAtomicRMWOp (verifyop mode), scan for the unique ReturnInst first and bail if there is none; only then allow the use_empty → Xchg shortcut. Together with the bug-2 fix (store moved after the op evaluation), a throwing op can no longer reach the store.


Bug 3: std::get<AtomicRMWInst::BinOp> on a false variant

Mechanism. In the expansion loop, after InlineFunction the shape is re-matched: BinOp = patternMatchAtomicRMWOp(RMW, &ValOp, NewVal) (line 391). The else branch (line 406-408) asserts the result isn't the true token and immediately does std::get<AtomicRMWInst::BinOp>(BinOp) — but the matcher can also return false (shape unconvertible), giving bad_variant_access (or UB without exceptions). Requires the pre-inlining check to match but InstSimplify folding during inlining to produce a non-convertible, non-identity shape. For integer ops every such fold lands on the identity (add x, 0 → x) which returns the Or fence token, so no stock repro is known; it's a landmine for any extension of the matcher (FP made it nearly reachable via fadd x, -0.0 → x until bug 4's fix routed that to Or too).

Fix on the branch: on false, recreate the op call (construction hoisted into a CreateModifyOp lambda), disconnect the dead leftover inlined code from the placeholder RMW (replaceAllUsesWith(undef) — it is write-free per canReorderWithRMW, so this is safe), and break to the cmpxchg-loop fallback.


Bug 4: FP identity ops emit invalid atomicrmw or double

Mechanism. An op that folds to the identity (fadd x, -0.0 → x under InstSimplify; or literally (o, v) -> o) returns the fence token (RetVal == Old → AtomicRMWInst::Or, line 244-248), and the expansion emits atomicrmw or %ptr, 0 with ConstantInt::getNullValue(Ty) (line 418-420). For an FP Ty this is invalid IR (or is integer-only) and the null constant is a ConstantFP. Unreachable on stock only because typed_store never emits FP-typed julia.atomicmodify (the integer-only gate, src/cgutils.cpp:2697); the pass itself advertises FP support and its test file declares julia.atomicmodify.f64, so any user of that interface hits this.

Repro on the branch (pre-fix): Base.unsafe_modify!(p::Ptr{Float64}, +, -0.0, :monotonic).

Fix on the branch: when Ty is not an integer, emit the fence RMW on the equivalent integer type (atomicrmw or ptr %p, iN 0) and bitcast the loaded value back to Ty for the old/new results.


Bug 5: unmatchable ops keep an opaque call in the fallback loop

Not a correctness bug — the original register-pressure problem from NOTES.md. The pass only runs InlineFunction on the RMW-conversion attempt; when the initial pattern match says "not an RMW shape" (e.g. sub i8 %b, %loaded, operands reversed — see mod_i8_subx in test/llvmpasses/atomic-modify.ll — or any op with control flow), the fallback insertRMWCmpXchgLoop keeps the op as a call. Since AlwaysInlinerPass runs much earlier in the pipeline (src/pipeline.cpp:632) and Julia's pipeline has no general inliner, nothing ever inlines it: the loop carries an opaque call site, with the callee-saved-register/spill cost that entails.

Fix on the branch: after building the fallback loop, InlineFunction the op call when the callee is a definition (tracking the return value through a temporary FreezeInst, same trick as the conversion path). The two mod_i8_subx* CHECK lines in test/llvmpasses/atomic-modify.ll were updated for the now-inlined output.


Suggested upstreaming order

  1. Bug 2 alone is a minimal, stock-crashing issue — file with the 5-line repro (crashes 1.13.0-rc1: release-relevant) and the one-line moveBeforePreserving fix.
  2. Bugs 1 and 3 are matcher/expansion hardening that should accompany any change that makes op bodies visible to the pass (the AlwaysInline change or FP support).
  3. Bug 4's fix belongs in the FP-enablement PR (NOTES.md fix C).
  4. Bug 5 is an independent quality improvement; needs a decision on inlining cost (the pass has a TODO for exactly this).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment