Investigation notes for mwe.jl. Versions tested: 1.12.6, 1.13.0-rc1,
1.14.0-DEV.2212 (062a90bc8c). Source references are to julia master
(6e40a4a448, 2026-06-08).
The MWE's code_llvm-based counting is partly misled by a reflection
artifact. The real situation, measured on native code of the actual JIT
output (code_native(...; dump_module=false)):
| case | 1.12 native | 1.13+/nightly native |
|---|---|---|
unsafe_modify!(Ptr{Int}, +) |
cmpxchg + call | lock add (fixed) |
@atomic AtomicMemory{Int}[i] += v |
cmpxchg + call | lock add (fixed) |
unsafe_modify!(Ptr{Float64}, +) |
cmpxchg + call | cmpxchg + call (still broken) |
@atomic AtomicMemory{Float64}[i] += v |
cmpxchg + call | cmpxchg + call (still broken) |
code_llvm (any of the above) |
loop + call | loop + call (misleading for Int!) |
So:
- Int was fixed in 1.13 by PR #57010 — but
@code_llvm/@code_native dump_module=truedon't show it (open issue #59645; exact mechanism pinned down below, not yet on the issue). - Float64 — the Ferrite case — is still broken on all versions including
nightly. No
atomicrmw fadd, and the+is not even inlined into the CAS loop, so the register-pressure cost from the opaque call persists. - On 1.10–1.12 nothing can be fixed retroactively; the manual
load/
unsafe_replace!CAS loop stays the right workaround there (it works because theold + vcall exists in Julia IR, where the Julia inliner can reach it). - Update: both gaps are fixed by the prototype on the
kc/atomicmodify-alwaysinlinebranch in~/julia— Float64+=now compiles to a singleatomicrmw fadd, and unmatchable ops inline into the CAS loop. See "Fix sketches" below for results and caveats.
The op of modifyfield!/memoryrefmodify!/atomic_pointermodify is a
function value passed to a builtin; the call op(old, v) happens inside
the intrinsic's semantics. The CAS retry loop is synthesized in C++ codegen
(typed_store in src/cgutils.cpp) and never exists as Julia IR, so the
Julia-level inliner cannot inline + into it. The optimizer only
devirtualizes: handle_modifyop!_call!
(Compiler/src/ssair/inlining.jl:1480) rewrites the call to
:invoke_modify carrying a resolved CodeInstance for +(T,T), which codegen
turns into a direct specsig call (julia_+_NNN) — specialized, but a call.
On the LLVM side, Julia's pipeline deliberately contains no general
inliner — only AlwaysInlinerPass (src/pipeline.cpp:632); inlining
decisions belong to the Julia inliner. So an inlinehint function emitted
into the module is never inlined by LLVM either. That's why the call
survives everywhere it appears, and why the hand-written cas_f64! is fine
(its + is inlined by the Julia inliner before codegen, leaving a plain
fadd in Julia IR).
This was understood from the start: PRs #41859/#42017 (vtjnash, 2021) created the codegen CAS-loop lowering and explicitly deferred "the invoke part".
What 1.13 added (PR #57010, vtjnash, merged May 2025)
For atomic modify, codegen now emits a fake call
%r = call {i64,i64} @julia.atomicmodify.i64.p0(ptr %ptr, ptr @helper, i8 ord, i8 ssid, ...)where @helper is a tiny private alwaysinline function built by
emit_modifyhelper (src/codegen.cpp:7130) whose body invokes the resolved
op. Because cross-module IR isn't visible to LLVM, the same PR added
emit_always_inline (src/codegen.cpp:10318): the JIT
(src/jitlayers.cpp:514) and AOT both emit the op's definition (private
linkage + inlinehint) into the same module.
A new late pass, ExpandAtomicModifyPass
(src/llvm-expand-atomic-modify.cpp, scheduled after LateLowerGCPass,
src/pipeline.cpp:582), then calls InlineFunction itself and
pattern-matches the op body (patternMatchAtomicRMWOp) to a real
atomicrmw add/sub/and/or/xor/nand/min/max/fadd/fsub/fmin/fmax/xchg; if it
can't match, it expands to a cmpxchg loop (block names
atomicrmw.start/atomicrmw.end — these names appear only on this
fallback path; an actual atomicrmw has no such blocks). This works for
Int: native code is a single lock add.
-
Gated out of the new path. The
julia.atomicmodifyemission intyped_storerequires!isboxed && realelty == elty && !intcast && elty->isIntegerTy() && ...(src/cgutils.cpp:2697). Any FP element type takes the intcast route (src/cgutils.cpp:2625-2631:eltybecomesiN, anintcastalloca is created), so!intcastfails and Float64 falls into the legacy CAS loop (xchg/done_xchgblocks — note: notatomicrmw.start; on 1.13+ only the Int path gets those names, in the reflection fallback). -
The legacy loop's call can't be inlined by anyone. The loop body calls the op via
emit_invoke(..., always_inline=true)(src/cgutils.cpp:2535), andemit_always_inlinedoes put a privatedefine fastcc double @julia_+_NNN(a singlefadd!) into the module — but tagged onlyInlineHint(src/codegen.cpp:10380). Since the pipeline has no general inliner andAlwaysInlinerPassonly handlesalwaysinline, the one-instruction function is never inlined. Native code keepsmovabs r15, offset "+"; ... call r15inside the loop. -
Even via the new path, the pass can't match FP yet. The helper for an FP type would be
iN → bitcast → fadd → bitcast → iN, andpatternMatchAtomicRMWOpcannot look through bitcasts — explicit TODOs atsrc/llvm-expand-atomic-modify.cpp:213and:321("be able to ignore some simple bitcasts (particularly f64 to i64)"). The pass's FAdd/FSub/FMin/FMax arms are currently unreachable from Julia-emitted code. (Its cmpxchg fallback helper already handles FP via bitcast, and LLVM supportsatomicrmw faddsince LLVM 9, so the backend is not the problem.)
Why code_llvm lies about Int on 1.13+ (refines open issue #59645)
jl_get_llvmf_defn (reflection) emits with the non-swiftcc ABI: functions
get their pgcstack via a call %pgcstack = call ptr @julia.get_pgcstack(),
and that intrinsic is declared with no attributes at that point in the
pipeline. When ExpandAtomicModifyPass pattern-matches the op, its
canReorderWithRMW check (src/llvm-expand-atomic-modify.cpp:177)
conservatively scans the op's whole body for memory writes; the unattributed
julia.get_pgcstack() call counts as may-write, so the match is rejected
and reflection shows the fallback loop + call fastcc @julia_+_0. Verified
with JULIA_LLVM_ARGS="--print-before=ExpandAtomicModify --print-module-scope":
- real JIT module: op is
swiftccwith gcstack as an argument, body is a bareadd→ pass fires →atomicrmw add; - reflection module: same op body plus one dead
julia.get_pgcstack()call → pass bails.
So on 1.13+ the only trustworthy view is code_native(...; dump_module=false) (disassembles the actually-JITted code) or
code_llvm(...; raw=true) before optimization.
- #41843 "Atomic read-modify-write gets much slower with the new API" (2021) — the canonical perf issue; closed July 2025 as fixed by #57010.
- #41563, #41859, #42017 (vtjnash, 2021) — atomic intrinsic codegen;
introduced the CAS-loop lowering, op call left unoptimized knowingly
("within 2x" of
Threads.Atomic). - #46971 (aviatesk, 2022) —
ModifyOpInfo/:invoke_modifyplumbing; why the op is a specialized call rather thanjl_apply_generic. - #45122 (tkf, 2022, closed stale 2026) — alternative design: recognize
known ops in Julia IR/codegen and emit
atomicrmwdirectly; vtjnash objected to dispatch-in-codegen; superseded by #57010. - #44932/#45147 (tkf) — same story for
@atomicswapon boxed fields; fixed withatomicrmw xchgon pointers + carried LLVM patch. - #57010 (vtjnash, merged May 2025 → 1.13) —
julia.atomicmodify+ExpandAtomicModifyPass+emit_always_inline. Fixed integer ops in real codegen. In-source TODOs: upstream into LLVMAtomicExpand, bitcast peeking (the FP blocker), inline-cost heuristics, longer sequences. - #50980 "Atomic operations don't generate optimal LLVM code" (2023) — still open, zero comments; now accurate only for FP + reflection.
- #59645 "Codegen introspection misleading for
atomic_max!" (2025) — still open; the reflection artifact. The pgcstack/canReorderWithRMWmechanism above is not yet recorded there. - #61583 — sysimage build slowdown caused by #57010's linking rework (collateral, fixed).
Ordered by leverage/effort for the FP case:
-
(A) Minimal, fixes the register pressure for every unmatched op: in
emit_always_inline, tag the emitted opAttribute::AlwaysInlineinstead ofInlineHint(src/codegen.cpp:10380).AlwaysInlinerPassruns early in the pipeline (src/pipeline.cpp:632, on by default), so the legacy CAS loop would get thefaddinlined — matching the hand-writtencas_f64!. Risk: these targets are only created for modify ops today, but a large op body would be force-inlined; could gate on inlining cost (jl_ir_inlining_costis already consulted injl_get_method_ir,src/codegen.cpp:10303).Tried it (branch
kc/atomicmodify-alwaysinlinein~/julia, one-line diff on top of 6e40a4a448) — it works. Native code on the patched build (code_native dump_module=false):case before after unsafe_modify!(Ptr{Int}, +)lock addlock add(unchanged)unsafe_modify!(Ptr{Float64}, +)cmpxchg + call "+"cmpxchg + inlined vaddsd, no call@atomic AtomicMemory{Float64}[1] += vcmpxchg + call cmpxchg + inlined vaddsd(only remaining call: coldijl_bounds_error_int)unsafe_modify!(Ptr{Float64}, addtwice)(user opx + 2v)cmpxchg + call cmpxchg + 2 inlined vaddsd, no callThe patched f64 loop is byte-for-byte the shape of the hand-written
cas_f64!:vmovq/vaddsd/vmovq/lock cmpxchg/jnewith no spills. The MWE dead-branch benchmark gap disappears (11.07 vs 10.76 ms, within variance), threaded@atomic +=correctness checks pass, andtest/atomics.jlpasses. Two caveats for a real PR:addFnAttr(AlwaysInline)is applied on both paths inemit_always_inline, including the case where the op was already emitted as a regular (externally visible) function in the same compile unit — in AOT that marks a shared definitionalwaysinlinemodule-wide. Should be restricted to the freshly-emitted private copy, or gated onjl_ir_inlining_cost.- Reflection (
@code_llvm) for Int now shows the fallback loop calling the private helper@0instead ofjulia_+— cosmetically different, same #59645 artifact (the unattributedjulia.get_pgcstack()call in the non-swiftcc reflection ABI still defeatspatternMatchAtomicRMWOp). Real JIT output is unaffected. Fix (D) would clean this up. - This gets the op inlined into the loop; it does not produce
atomicrmw fadd. That still needs (B)/(C).
-
(B) Real fix, gets
atomicrmw fadd: teachpatternMatchAtomicRMWOpto peek through size-preserving bitcasts (the in-source TODOs atllvm-expand-atomic-modify.cpp:213/:321), and relax thetyped_storegate (src/cgutils.cpp:2697) to admitintcast && intcast_eltyp->isFloatingPointTy()—emit_modifyhelperneeds to bitcast theiNargument to the FP type before invoking the op. The pass's cmpxchg fallback already does the FP bitcasting dance, so only the matcher and the gate are missing. -
(C) Alternative to B: emit
julia.atomicmodify.f64with the FP type directly and let the pass operate on FP values (itscreateWeakCmpXchgInstFunalready bitcasts for the cmpxchg); avoids matching through bitcasts but touches more of the intcast machinery intyped_store.Implemented (C) — it works, and no bitcast-peeking was needed. It turns out the pass's own test suite (
test/llvmpasses/atomic-modify.ll) already declaresjulia.atomicmodify.f64with FP ops and expectsatomicrmw fadd— the FP-typed form was the anticipated design; codegen just never emitted it. Changes onkc/atomicmodify-alwaysinline(on top of fix A):src/cgutils.cpp(typed_storefast path): gate becomes(intcast ? intcast_eltyp->isFloatingPointTy() : elty->isIntegerTy()); the helper, the pseudo-intrinsic name (julia.atomicmodify.f64.p0etc.), and the{T,T}return struct use the FP type directly.emit_modifyhelperneeded no changes.src/llvm-expand-atomic-modify.cpp: see "latent bugs" below.
code_llvmon the patched build (reflection now truthful thanks to the pgcstack fix):unsafe_modify!(Ptr{Float64}, +),Float32,-,@atomic AtomicMemory{Float64}[1] += v, and@atomic a.x += von an atomic field all produce a singleatomicrmw fadd/fsub(plus the cold boundserror branch for AtomicMemory). An unmatchable op (weird(x,v) = x > 0 ? x+v : x-v) becomes a cmpxchg loop with the body inlined, zero calls.x86_64native code forfaddis still a cmpxchg loop (no FP-RMW instruction exists; LLVM's backend legalizes it), so the runtime win vs fix A is on targets with native FP atomics (GPUs, ARM LSFE); the IR win is canonicality and optimizability.Validation:
test/atomics.jl12145 pass,test/intrinsics.jl536 pass,test/llvmpasses/atomic-modify.llpasses (two CHECK lines updated: the reversed-subtests now expect the inlinedsubinstead of an opaque call — the improvement, not a regression), threaded+=and old/new pair correctness checks pass.Latent pass bugs found and fixed along the way (all in
src/llvm-expand-atomic-modify.cpp; the first one initially broke 5test/intrinsics.jlcases). Full writeups with repros and reachability analysis in PASS_BUGS.md:- Throwing ops could commit a garbage store. For a type-unstable op
(e.g.
unsafe_modify!(p::Ptr{Int8}, (x,v)->v, UInt32(1)), which must throwTypeError), the helper ends in the typecheck-throw with noret. Once fix A pre-inlines the op, the loaded-value argument becomes unused andpatternMatchAtomicRMWOphit itsuse_empty → Xchgshortcut before checking a return exists, emittingatomicrmw xchg ptr, poisonahead of the throw. Fixed by scanning for the uniqueretfirst. - Use-before-def for computed Xchg values — segfaults stock
1.13.0-rc1 and nightly (LLVM RegisterCoalescer crash; 1.12 is fine):
atomic_pointermodify(p, (o,v)->2v, 3, :monotonic). The placeholder RMW stays before the inlined code that defines its value operand. Fixed by moving the RMW to the original modify site — which also guarantees a throwing op never reaches the store. 5-line repro in PASS_BUGS.md; should be filed upstream promptly (release-relevant for 1.13). std::get<BinOp>on afalsevariant (bad_variant_access) when inlining+InstSimplify folds the op into an unconvertible shape. Fixed by recreating the op call and taking the cmpxchg-loop fallback.- FP identity ops (
fadd x, -0.0folds tox) hit the fence conversionatomicrmw or %p, 0, which is integer-only — now emitted on the equivalent integer type with the old value bitcast back. - Unmatchable ops left an opaque call in the fallback loop (the pass
only ran
InlineFunctionon the RMW-attempt path, andAlwaysInlinerPassruns earlier in the pipeline than this pass). The fallback now inlines the op body into the loop when the callee is a definition.
-
(D) Reflection fidelity (#59645): either give the reflection-ABI
julia.get_pgcstackdeclaration proper attributes (memory(none)/nosync-equivalent) socanReorderWithRMWignores it, or havecanReorderWithRMWspecial-case known Julia intrinsics. Small, self-contained.
Nothing here is backportable to 1.12 (the whole #57010 machinery is absent); Ferrite's manual CAS loop remains correct for 1.10–1.12, and on stock 1.13+ it is still the better codegen for Float64 until (A)/(C) land upstream.
The combined prototype (A + C + the pass fixes) lives on the
kc/atomicmodify-alwaysinline branch in ~/julia (uncommitted working tree,
~84 lines over 4 files: src/cgutils.cpp, src/codegen.cpp,
src/llvm-expand-atomic-modify.cpp, test/llvmpasses/atomic-modify.ll).
Before PRing: restrict the AlwaysInline attr to the freshly emitted private
copy (or gate on jl_ir_inlining_cost), add FP + noreturn + identity cases
to test/llvmpasses/atomic-modify.ll, and file the pass bugs from
PASS_BUGS.md — bug 2 there segfaults stock 1.13.0-rc1/nightly
and deserves its own immediate issue + one-line fix.
Title:
@atomic x[i] += v/unsafe_modify!on Float64 still compiles to a CAS loop with a non-inlined call to+(noatomicrmw fadd)Since #57010 (1.13), integer atomic modify correctly compiles to a single
atomicrmwinstruction (lock addon x86_64) in actual JIT output. Floating-point modify does not — on 1.13.0-rc1 and 1.14.0-DEV.2212 it still produces the pre-#57010 lowering: a cmpxchg loop that calls the op through a function pointer:modify_f64!(p::Ptr{Float64}, v::Float64) = (Base.unsafe_modify!(p, +, v, :monotonic); nothing) # same for: @atomic :monotonic m[1] += v with m::AtomicMemory{Float64}; code_native(modify_f64!, (Ptr{Float64}, Float64); dump_module=false), 1.14.0-DEV.2212 movabs r15, offset "+" L48: vmovq xmm0, r14 vmovsd xmm1, qword ptr [rbp - 32] call r15 ; <- + not inlined, in the loop vmovq rcx, xmm0 mov rax, r14 lock cmpxchg qword ptr [rbx], rcx mov r14, rax jne L48Expected:
atomicrmw fadd(supported since LLVM 9), or at minimum thefaddinlined into the loop (a hand-writtenunsafe_replace!CAS loop compiles to exactly that, since the Julia inliner sees the+call there).Besides throughput of the atomic itself, the opaque call has a second-order cost: when the modify sits in a rarely/never-taken branch of a hot loop, values live across the potential call site must be kept in callee-saved registers or spilled. In Ferrite.jl's sparse-assembly inner loop this costs a stable ~20%; the workaround is a manual load/
unsafe_replace!loop.Why it happens (refs to current master):
- FP element types are gated out of the
julia.atomicmodifypath:typed_storerequires!intcast && elty->isIntegerTy()(src/cgutils.cpp:2697), and FP always setsintcast(src/cgutils.cpp:2625). So Float64 takes the legacy CAS-loop emission.- In the legacy path the op is emitted into the module by
emit_always_inline(a private single-faddfunction!) but only withinlinehint(src/codegen.cpp:10380), and Julia's LLVM pipeline has no general inliner — onlyAlwaysInlinerPass(src/pipeline.cpp:632) — so nothing ever inlines it.- Even if routed through
ExpandAtomicModifyPass, the matcher cannot yet look through thei64↔f64bitcasts; the FAdd/FSub/FMin/FMax arms are currently dead code (TODOs atsrc/llvm-expand-atomic-modify.cpp:213,321).Possible fixes: (a) mark
emit_always_inline-emitted opsalwaysinline(cheap; fixes the register-pressure/call problem for all unmatched ops); (b) implement the bitcast-peeking TODO and relax thetyped_storegate so FP reaches the pass and folds toatomicrmw fadd.Related: #57010 (the integer fix), #50980 (open; now FP-specific), #41843 (closed by #57010), #59645 (separate issue:
@code_llvmshows the fallback loop even for Int because the reflection ABI's unattributedjulia.get_pgcstack()call defeats the pass'scanReorderWithRMWscan — happy to file details there).
- "identical on 1.12.6, 1.13.0-rc1, 1.14.0-DEV" — only true of
code_llvmoutput. Real native code differs: Int is fixed on 1.13+. - "names the expansion blocks atomicrmw.start without emitting any atomicrmw"
— applies to the Int case in reflection only; the f64 path keeps the old
xchg/done_xchgblocks on all versions.