Got it — you’ve got two environments:
- Ruby 4.x on Alpine (musl)
- Ruby 3.4.8 on Debian (glibc) with Puma
jeprof works best when:
- you are actually using jemalloc as the allocator, and
- you have symbols available (debug packages help a lot).
Below I’ll give you separate “known-good” recipes for Debian vs Alpine, plus exactly how to run jeprof and interpret/diff results.
In Dockerfile:
RUN apt-get update && apt-get install -y --no-install-recommends \
libjemalloc2 libjemalloc-dev \
graphviz perl \
binutils \
&& rm -rf /var/lib/apt/lists/*jeprof comes with libjemalloc-dev on Debian. (manpages.debian.org)
Add env vars (Deployment / Helm values):
env:
- name: LD_PRELOAD
value: /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
- name: MALLOC_CONF
value: "prof:true,prof_active:true,prof_prefix:/tmp/jeprof,lg_prof_sample:17"(You can tune sampling later.)
Inside the pod:
worker=$(pgrep -f "puma: cluster worker" | head -n1)
grep -E 'jemalloc' /proc/$worker/maps | headTrigger a dump from a worker:
kill -USR2 $worker
ls -lh /tmp/jeprof*You should see /tmp/jeprof.<pid>.<seq>.heap.
Top allocators (text):
jeprof --show_bytes "$(command -v ruby)" /tmp/jeprof.$worker.*.heap | head -n 60Generate a PDF call graph:
jeprof --show_bytes --pdf "$(command -v ruby)" /tmp/jeprof.$worker.*.heap > /tmp/heap.pdfCopy out:
kubectl cp -n <ns> <pod>:/tmp/heap.pdf ./heap.pdfTake two dumps at different times (or after a workload), then:
jeprof --show_bytes --diff_base=/tmp/jeprof.$worker.0001.heap \
"$(command -v ruby)" /tmp/jeprof.$worker.0002.heap | head -n 80This highlights what grew.
Alpine has a jemalloc package available (at least on edge; varies by Alpine version). (pkgs.alpinelinux.org)
In Dockerfile:
RUN apk add --no-cache jemalloc perl graphviz binutilsNote: some Alpine releases may not ship
jeprofas a separate binary even if jemalloc is present. Ifjeprofisn’t found, you can:
- install a jemalloc -dev package if available, or
- copy
jeproffrom a build stage, or - build jemalloc from source (heavier).
On Alpine the library path is typically:
/usr/lib/libjemalloc.so.2
Set:
env:
- name: LD_PRELOAD
value: /usr/lib/libjemalloc.so.2
- name: MALLOC_CONF
value: "prof:true,prof_active:true,prof_prefix:/tmp/jeprof,lg_prof_sample:17"Verify:
worker=$(pgrep -f "puma: cluster worker" | head -n1)
grep -E 'jemalloc' /proc/$worker/maps | headIf you see errors like “LD_PRELOAD cannot be preloaded”, it’s almost always a wrong path/arch. (Stack Overflow)
Everything else (USR2 dump, jeprof usage) is the same once jemalloc + jeprof exist.
Given your workers are dominated by RssAnon, jeprof typically points to one of these buckets:
Look for stacks mentioning:
vm_malloc,ruby_xmalloc,rb_newobjrb_hash_*,rb_str_*,rb_ary_*
This often means: lots of Ruby objects (params, JSON, templating, caching).
Look for shared objects in stacks like:
json/ext/parser.sonokogiri/*.soopenssl.so
Then you optimize usage patterns (streaming parse, reuse, reduce conversions) or reduce workload.
If jeprof shows “many small allocs from many places” and RSS doesn’t fall after GC, fragmentation is likely. That’s where jemalloc itself often improves things (even without profiling), and profiling helps you find the churny sources.
-
Start with Debian pod (easier to get
jeprof+ symbols). -
Enable jemalloc profiling on one canary.
-
Take three snapshots:
- right after boot
- after warm traffic
- after 30–60 minutes steady traffic
-
Run:
jeprof --show_bytes ruby heapjeprof --diff_base=early.heap ruby later.heap
You’ll end up with: the specific call paths responsible for growth.
- If you want hot paths: use a higher sampling rate (more detail), e.g.
lg_prof_sample:15. - If you want leak/growth: keep
lg_prof_sample:17–19, take diff snapshots.
(If you tell me your traffic level on that pod, I’ll suggest a safe sampling rate.)
If you paste:
ls -l /tmp/jeprof*after sendingUSR2, and- the first ~40 lines of
jeprof --show_bytes ruby heap
…I’ll help you interpret it into “this is Ruby churn vs native lib vs fragmentation” and what change is most likely to cut the ~350 MiB/worker.
Great question — jeprof is the right tool when most of your memory is anonymous / native allocations, like in your case.
Below is a practical, end-to-end guide that works well for Rails + Puma on k8s, with notes on why each step matters and how to read the results.
I’ll assume:
- Linux container
- You can rebuild the image (or make a canary)
- You’re OK profiling one pod, not the whole fleet
jemalloc + jeprof answers one very specific question:
Which code paths are allocating native memory (malloc), and how much?
That includes:
- Ruby’s internal malloc usage
- JSON, Nokogiri, OpenSSL
- allocator arenas & fragmentation
- anything calling
mallocunder the hood
It does not replace Ruby heap tools — it complements them.
RUN apt-get update && apt-get install -y \
libjemalloc2 \
jemalloc \
graphviz \
perl \
&& rm -rf /var/lib/apt/lists/*You need:
- libjemalloc → allocator
- jeprof → analysis script
- graphviz → renders call graphs
Verify in the container:
jeprof --helpSet both LD_PRELOAD and MALLOC_CONF.
env:
- name: LD_PRELOAD
value: /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
- name: MALLOC_CONF
value: >
prof:true,
prof_active:true,
prof_prefix:/tmp/jeprof,
prof_leak:true,
lg_prof_sample:17prof:true→ enable profilingprof_active:true→ start collecting immediatelyprof_prefix:/tmp/jeprof→ where dumps golg_prof_sample:17→ sampling rate (~1 allocation per 128KB, good balance)prof_leak:true→ keeps backtraces for long-lived allocations
💡 Do this on ONE pod first (canary). jemalloc profiling has overhead.
Inside the pod:
cat /proc/$(pgrep -f 'puma: cluster worker' | head -n1)/maps | grep jemallocYou should see something like:
libjemalloc.so.2
If not → jeprof will be useless.
jemalloc writes heap dumps on demand via signals.
pgrep -f "puma: cluster worker"kill -USR2 <worker_pid>This creates files like:
/tmp/jeprof.<pid>.<seq>.heap
List them:
ls -lh /tmp/jeprof*👉 Take snapshots:
- after boot
- after warm traffic
- later (to see growth)
jeprof --show_bytes /path/to/ruby /tmp/jeprof.<pid>.*.heapYou’ll get output like:
Total: 412.3 MB
120.5 MB json/ext/parser.so
88.1 MB nokogiri/xml_document.c
65.7 MB vm_malloc
40.2 MB openssl.so
This already answers “which native subsystems dominate memory?”
jeprof --show_bytes --pdf /path/to/ruby /tmp/jeprof.<pid>.*.heap > heap.pdfOr SVG:
jeprof --show_bytes --svg /path/to/ruby /tmp/jeprof.<pid>.*.heap > heap.svgCopy it locally:
kubectl cp <ns>/<pod>:/tmp/heap.pdf ./heap.pdfLook for:
mallocruby_xmallocjson_parsexmlNewDoc(Nokogiri)opensslrb_hash_asetrb_ary_new
| Pattern | Meaning |
|---|---|
json/ext/parser.so |
Heavy JSON parsing |
nokogiri |
XML/HTML parsing, often from templates or feeds |
vm_malloc |
Ruby VM heap / internal structures |
rb_hash_* |
Large hashes (params, caches, configs) |
malloc_consolidate |
Fragmentation |
This is huge.
jeprof --show_bytes --diff_base=jeprof.123.0001.heap \
/path/to/ruby jeprof.123.0002.heapThis tells you:
- what grew
- what stayed stable
- what’s leaking vs transient
If memory is “leaking”, you’ll see the same call paths grow over time.
You observed:
RssAnondominatesPrivate_Dirtydominates- GC doesn’t reduce RSS much
jeprof lets you answer:
- is it Ruby heap pressure?
- or native libs?
- or fragmentation?
Typical outcomes:
- JSON / Nokogiri heavy → optimize parsing, streaming, reuse
- Hash-heavy paths → reduce object churn
- Fragmentation-heavy → jemalloc already helps; tune sampling or threads
- Metrics/instrumentation → disable or batch
- Never enable jemalloc profiling fleet-wide
- Use 1 pod
- Capture a few snapshots
- Turn it off afterward (remove
MALLOC_CONF) - Expect 5–15% overhead while profiling
- Rebuild image with jemalloc + jeprof
- Enable via
LD_PRELOAD+MALLOC_CONF kill -USR2 <worker_pid>jeprof --show_bytes ruby jeprof.*.heap- Look for dominant native allocators
- Diff snapshots to find growth
If you want, tell me:
- your Ruby version
- base image (Alpine vs Debian)
- whether you want allocation hotspots or leak detection
I can tailor the exact MALLOC_CONF and jeprof commands for your setup and traffic profile.
You can break “anonymous heap” into (A) Ruby heap, (B) native malloc, (C) allocator fragmentation/arenas, and (D) app caches that cause COW breakage — but no single metric shows all of it. The trick is to use a small toolkit and compare signals.
Below is a practical, production-friendly menu, ordered from “safe + low overhead” to “deeper / heavier”.
Run inside the pod:
kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec ruby -e 'p GC.stat'"If you can run in Rails context (loads app, heavier):
kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'pp GC.stat'"Key fields to look at:
- heap pages / slots (overall Ruby heap size trend)
malloc_increase_bytes/malloc_increase_bytes_limit(Ruby’s view of malloc growth)old_objectsrising without coming down (possible leak pattern)
kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'require \"objspace\"; puts ObjectSpace.memsize_of_all'"This estimates memory held by Ruby objects (not everything native).
This is the best “show me what kinds of objects are big” view:
kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner '
require \"objspace\"
h = Hash.new(0)
ObjectSpace.each_object do |o|
h[o.class] += ObjectSpace.memsize_of(o) rescue 0
end
h.sort_by { |_,v| -v }.first(30).each { |k,v| puts \"#{k}\t#{v/1024/1024} MiB\" }
'"Use this when you have a quiet window; it can be expensive on large heaps.
Ruby heap stats won’t capture most of this. You want allocator-level introspection.
kubectl exec -n <ns> -it <pod> -- sh -lc \
"cat /proc/$(pgrep -f 'puma: cluster worker' | head -n1)/maps | grep -E 'jemalloc|tcmalloc' | head"If nothing shows, you’re likely on glibc malloc.
The best way is usually to run with a different allocator (jemalloc) and compare RssAnon / Private_Dirty / PSS before vs after. That’s often more actionable than trying to introspect glibc malloc deeply in-prod.
If you can add packages, libc6-dbg + gdb can inspect malloc arenas, but that’s intrusive.
This is a huge contributor in “big RssAnon + lots of threads”.
You already did this. Fragmentation typically shows as:
RssAnonhighPrivate_Dirtyhigh- and memory not dropping after GC / traffic dips
Inside the pod:
kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'GC.start(full_mark: true, immediate_sweep: true); sleep 1; puts GC.stat'"Then re-check:
kubectl exec -n <ns> -it <pod> -- sh -lc \
"for p in $(pgrep -f \"puma: cluster worker\"); do echo ==$p==; egrep \"VmRSS|RssAnon\" /proc/$p/status; done"If Ruby heap shrinks (objects freed) but RssAnon barely changes, that points to native allocations and/or fragmentation (Ruby freed objects but allocator kept pages).
Note: Forcing GC in production can hurt latency; do this only for diagnosis.
If after enabling jemalloc your per-worker RssAnon drops meaningfully or grows slower, you’ve confirmed fragmentation/arena issues were significant.
These show up as private dirty growth after fork, often correlated with first requests.
Right after startup (before much traffic), capture:
kubectl exec -n <ns> -it <pod> -- sh -lc \
"for p in $(pgrep -f \"puma: cluster worker\"); do echo ==$p==; awk \"/^(Pss|Shared_Clean|Private_Dirty):/ {print}\" /proc/$p/smaps_rollup; done"Then after warm traffic, do it again. Big increase in Private_Dirty = cache warming / post-fork writes.
If you’re using config.cache_store = :memory_store, every worker has its own in-process cache (private memory).
Switching to Redis/Memcached moves that memory out of the worker.
Your config enables PerformanceMetricsCollector per worker when PERFORMANCE_MODE is set. Try disabling and compare:
- worker PSS
- worker Private_Dirty
- worker RssAnon
If you can run jemalloc, you can enable heap profiling and get real allocation hotspots (native + Ruby’s malloc usage).
Typical pattern (depends on image & env):
- set
LD_PRELOADto jemalloc - set
MALLOC_CONFfor profiling - use
jeprofto analyze
This requires changes to the container and is more involved, but it’s the most direct way to answer “which native code is allocating?”
- Ruby heap size & growth:
GC.stat(and optionallyObjectSpace.memsize_of_all) - Does GC affect RSS? Force a GC once (carefully) and compare
RssAnon - Cache/COW effect: compare
Private_Dirtyright after boot vs after warm traffic - Allocator A/B: enable jemalloc in a canary and compare per-worker
RssAnon+Private_Dirty
If you tell me:
- Ruby version (2.7/3.0/3.1/3.2/3.3?)
- whether you can change the image / add packages
- whether
PERFORMANCE_MODEis on - what
config.cache_storeis
…I’ll pick the fastest path and give you exact commands tailored to your setup (including a safe canary strategy on k8s).
You do have some sharing (~100 MiB/worker), but it’s small compared to the ~330–380 MiB/worker private dirty anonymous memory you measured. That’s why it looks like “almost no sharing”.
Copy-on-write (COW) sharing only helps for pages that are:
- created before fork, and
- never written to after fork.
In a Rails app, a lot of memory either:
- is allocated after fork (request handling, caches, DB connections, JSON parsing, template rendering), or
- starts shared but gets written to in each worker (breaking COW), turning shared pages into private dirty pages.
Given your /proc/<pid>/status shows RssAnon dominates, most of your footprint is “heap-like” memory that naturally tends to become private.
Common Rails-specific COW breakers:
- Autoloading / Zeitwerk activity after fork (loading constants lazily)
- Rails caches warming per worker (fragment cache, memory store, i18n, view/template caches)
- Runtime memoization (class variables, constants mutated after boot)
- Per-worker DB/Redis client setup (connection pools, prepared statements, SSL state)
- Instrumentation/metrics/tracing (buffers, labels, exporter state)
- malloc arenas + fragmentation (glibc creates per-thread arenas; each worker ends up with lots of private heap pages)
- GC behavior: Ruby heap pages get dirtied as objects are allocated/updated
So: low sharing is usually not “Puma is wrong” — it’s “your workers quickly diverge in heap and caches”.
Goal: allocate as much as possible before fork, so it can be shared.
In production, ensure:
config.eager_load = true- Avoid lazy loading code paths on first request
You can force eager load in an initializer or before fork by touching parts of the app (careful—don’t do heavy one-off work that will then be mutated per worker).
Validate it’s working: after startup but before heavy traffic, check Shared_Clean is higher and Private_Dirty lower.
Classic offenders:
- big constant hashes that workers mutate
- global caches written to on first request
- “warming” caches in each worker
Prefer:
- freeze large constants (
.freeze) - avoid global mutable state
- move caching to external stores (Redis/Memcached) when feasible
A quick code smell: ||= memoization on class-level or global objects that are touched after fork across many code paths.
With preload_app!, you generally want:
before_fork do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
on_worker_boot do
ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
# other reconnects...
endThis prevents inherited connections and reduces weird shared state. It won’t magically halve memory, but it removes a real source of post-fork churn.
Your footprint is overwhelmingly RssAnon, so allocator behavior matters.
glibc malloc tends to:
- create multiple arenas (especially with threads)
- fragment over time
- keep memory in private dirty pages
jemalloc often:
- fragments less
- returns/stabilizes memory better
- reduces “mysterious” per-worker anon growth
This is one of the best “no code change” levers for exactly the profile you have.
If you add workers, memory scales almost linearly because each worker has its own heap.
If you instead:
- keep workers low
- raise threads (within DB/pool limits)
…you can get more throughput without duplicating heaps as much.
This doesn’t increase sharing, but it reduces the need for sharing.
Use the exact numbers you already have:
For each worker, track:
Private_Dirty(should go down or grow slower)Shared_Clean(should go up after better preload/eager load)Pss(should go down per worker)
Run right after boot (no traffic), then after warm traffic:
for p in $(pgrep -f "puma: cluster worker"); do
echo "== $p ==";
awk '/^(Pss|Shared_Clean|Shared_Dirty|Private_Clean|Private_Dirty):/ {print}' /proc/$p/smaps_rollup
doneIf sharing is improved, you’ll see higher Shared_Clean and lower Private_Dirty early, and slower growth over time.
Based on your numbers:
- Private_Dirty dominates (hundreds of MiB)
RssAnondominates- shared is only ~100 MiB
That combination strongly points to:
- heavy per-worker heap allocation after fork, plus
- allocator fragmentation, plus
- cache warming / lazy loads / instrumentation causing post-fork writes.
If you want, paste:
RAILS_ENV,RAILS_SERVE_STATIC_FILES,config/environments/production.rbvalues foreager_loadandcache_classes- whether
PERFORMANCE_MODEis enabled - Ruby version
…and I’ll tell you which 1–2 changes are most likely to move the needle for your setup (and how to validate with your PSS/shared/private metrics).