Skip to content

Instantly share code, notes, and snippets.

@joeywang
Last active January 24, 2026 22:57
Show Gist options
  • Select an option

  • Save joeywang/14eeffb90cb5ff0128438afcaabb5c8c to your computer and use it in GitHub Desktop.

Select an option

Save joeywang/14eeffb90cb5ff0128438afcaabb5c8c to your computer and use it in GitHub Desktop.

Got it — you’ve got two environments:

  • Ruby 4.x on Alpine (musl)
  • Ruby 3.4.8 on Debian (glibc) with Puma

jeprof works best when:

  1. you are actually using jemalloc as the allocator, and
  2. you have symbols available (debug packages help a lot).

Below I’ll give you separate “known-good” recipes for Debian vs Alpine, plus exactly how to run jeprof and interpret/diff results.


Debian (Ruby 3.4.8) — easiest path

1) Install what you need

In Dockerfile:

RUN apt-get update && apt-get install -y --no-install-recommends \
  libjemalloc2 libjemalloc-dev \
  graphviz perl \
  binutils \
  && rm -rf /var/lib/apt/lists/*

jeprof comes with libjemalloc-dev on Debian. (manpages.debian.org)

2) Enable jemalloc + profiling

Add env vars (Deployment / Helm values):

env:
  - name: LD_PRELOAD
    value: /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
  - name: MALLOC_CONF
    value: "prof:true,prof_active:true,prof_prefix:/tmp/jeprof,lg_prof_sample:17"

(You can tune sampling later.)

3) Verify jemalloc is active

Inside the pod:

worker=$(pgrep -f "puma: cluster worker" | head -n1)
grep -E 'jemalloc' /proc/$worker/maps | head

4) Produce a heap dump (snapshot)

Trigger a dump from a worker:

kill -USR2 $worker
ls -lh /tmp/jeprof*

You should see /tmp/jeprof.<pid>.<seq>.heap.

5) Analyze with jeprof

Top allocators (text):

jeprof --show_bytes "$(command -v ruby)" /tmp/jeprof.$worker.*.heap | head -n 60

Generate a PDF call graph:

jeprof --show_bytes --pdf "$(command -v ruby)" /tmp/jeprof.$worker.*.heap > /tmp/heap.pdf

Copy out:

kubectl cp -n <ns> <pod>:/tmp/heap.pdf ./heap.pdf

6) Leak / growth: diff two snapshots

Take two dumps at different times (or after a workload), then:

jeprof --show_bytes --diff_base=/tmp/jeprof.$worker.0001.heap \
  "$(command -v ruby)" /tmp/jeprof.$worker.0002.heap | head -n 80

This highlights what grew.


Alpine (Ruby 4.x) — works, but paths/tools differ

Alpine has a jemalloc package available (at least on edge; varies by Alpine version). (pkgs.alpinelinux.org)

1) Install dependencies

In Dockerfile:

RUN apk add --no-cache jemalloc perl graphviz binutils

Note: some Alpine releases may not ship jeprof as a separate binary even if jemalloc is present. If jeprof isn’t found, you can:

  • install a jemalloc -dev package if available, or
  • copy jeprof from a build stage, or
  • build jemalloc from source (heavier).

2) Set LD_PRELOAD path (musl)

On Alpine the library path is typically:

  • /usr/lib/libjemalloc.so.2

Set:

env:
  - name: LD_PRELOAD
    value: /usr/lib/libjemalloc.so.2
  - name: MALLOC_CONF
    value: "prof:true,prof_active:true,prof_prefix:/tmp/jeprof,lg_prof_sample:17"

Verify:

worker=$(pgrep -f "puma: cluster worker" | head -n1)
grep -E 'jemalloc' /proc/$worker/maps | head

If you see errors like “LD_PRELOAD cannot be preloaded”, it’s almost always a wrong path/arch. (Stack Overflow)

Everything else (USR2 dump, jeprof usage) is the same once jemalloc + jeprof exist.


How to read jeprof output for your “anon heap” problem

Given your workers are dominated by RssAnon, jeprof typically points to one of these buckets:

A) Ruby VM / object churn

Look for stacks mentioning:

  • vm_malloc, ruby_xmalloc, rb_newobj
  • rb_hash_*, rb_str_*, rb_ary_*

This often means: lots of Ruby objects (params, JSON, templating, caching).

B) JSON / Nokogiri / OpenSSL heavy native allocs

Look for shared objects in stacks like:

  • json/ext/parser.so
  • nokogiri/*.so
  • openssl.so

Then you optimize usage patterns (streaming parse, reuse, reduce conversions) or reduce workload.

C) Fragmentation / arenas

If jeprof shows “many small allocs from many places” and RSS doesn’t fall after GC, fragmentation is likely. That’s where jemalloc itself often improves things (even without profiling), and profiling helps you find the churny sources.


Recommended workflow for you (fast + reliable)

  1. Start with Debian pod (easier to get jeprof + symbols).

  2. Enable jemalloc profiling on one canary.

  3. Take three snapshots:

    • right after boot
    • after warm traffic
    • after 30–60 minutes steady traffic
  4. Run:

    • jeprof --show_bytes ruby heap
    • jeprof --diff_base=early.heap ruby later.heap

You’ll end up with: the specific call paths responsible for growth.


Quick check: do you want “alloc hot paths” or “leak/growth”?

  • If you want hot paths: use a higher sampling rate (more detail), e.g. lg_prof_sample:15.
  • If you want leak/growth: keep lg_prof_sample:17–19, take diff snapshots.

(If you tell me your traffic level on that pod, I’ll suggest a safe sampling rate.)


If you paste:

  • ls -l /tmp/jeprof* after sending USR2, and
  • the first ~40 lines of jeprof --show_bytes ruby heap

…I’ll help you interpret it into “this is Ruby churn vs native lib vs fragmentation” and what change is most likely to cut the ~350 MiB/worker.


Great question — jeprof is the right tool when most of your memory is anonymous / native allocations, like in your case.

Below is a practical, end-to-end guide that works well for Rails + Puma on k8s, with notes on why each step matters and how to read the results.

I’ll assume:

  • Linux container
  • You can rebuild the image (or make a canary)
  • You’re OK profiling one pod, not the whole fleet

0️⃣ Mental model (important before touching jeprof)

jemalloc + jeprof answers one very specific question:

Which code paths are allocating native memory (malloc), and how much?

That includes:

  • Ruby’s internal malloc usage
  • JSON, Nokogiri, OpenSSL
  • allocator arenas & fragmentation
  • anything calling malloc under the hood

It does not replace Ruby heap tools — it complements them.


1️⃣ Install jemalloc + jeprof in the image

Debian / Ubuntu base image

RUN apt-get update && apt-get install -y \
  libjemalloc2 \
  jemalloc \
  graphviz \
  perl \
  && rm -rf /var/lib/apt/lists/*

You need:

  • libjemalloc → allocator
  • jeprof → analysis script
  • graphviz → renders call graphs

Verify in the container:

jeprof --help

2️⃣ Enable jemalloc with profiling

Set both LD_PRELOAD and MALLOC_CONF.

Example env vars (k8s-friendly)

env:
  - name: LD_PRELOAD
    value: /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
  - name: MALLOC_CONF
    value: >
      prof:true,
      prof_active:true,
      prof_prefix:/tmp/jeprof,
      prof_leak:true,
      lg_prof_sample:17

What these mean (important):

  • prof:true → enable profiling
  • prof_active:true → start collecting immediately
  • prof_prefix:/tmp/jeprof → where dumps go
  • lg_prof_sample:17 → sampling rate (~1 allocation per 128KB, good balance)
  • prof_leak:true → keeps backtraces for long-lived allocations

💡 Do this on ONE pod first (canary). jemalloc profiling has overhead.


3️⃣ Confirm jemalloc is actually in use

Inside the pod:

cat /proc/$(pgrep -f 'puma: cluster worker' | head -n1)/maps | grep jemalloc

You should see something like:

libjemalloc.so.2

If not → jeprof will be useless.


4️⃣ Generate a heap profile snapshot

jemalloc writes heap dumps on demand via signals.

Find a worker PID

pgrep -f "puma: cluster worker"

Trigger a heap dump

kill -USR2 <worker_pid>

This creates files like:

/tmp/jeprof.<pid>.<seq>.heap

List them:

ls -lh /tmp/jeprof*

👉 Take snapshots:

  • after boot
  • after warm traffic
  • later (to see growth)

5️⃣ Run jeprof (this is the magic)

Basic summary (top allocators)

jeprof --show_bytes /path/to/ruby /tmp/jeprof.<pid>.*.heap

You’ll get output like:

Total: 412.3 MB
  120.5 MB  json/ext/parser.so
   88.1 MB  nokogiri/xml_document.c
   65.7 MB  vm_malloc
   40.2 MB  openssl.so

This already answers “which native subsystems dominate memory?”


6️⃣ Generate a call graph (visual, very powerful)

Create a PDF

jeprof --show_bytes --pdf /path/to/ruby /tmp/jeprof.<pid>.*.heap > heap.pdf

Or SVG:

jeprof --show_bytes --svg /path/to/ruby /tmp/jeprof.<pid>.*.heap > heap.svg

Copy it locally:

kubectl cp <ns>/<pod>:/tmp/heap.pdf ./heap.pdf

7️⃣ How to read the graph (this is key)

Big boxes = big memory

Look for:

  • malloc
  • ruby_xmalloc
  • json_parse
  • xmlNewDoc (Nokogiri)
  • openssl
  • rb_hash_aset
  • rb_ary_new

Common Rails patterns you’ll see

Pattern Meaning
json/ext/parser.so Heavy JSON parsing
nokogiri XML/HTML parsing, often from templates or feeds
vm_malloc Ruby VM heap / internal structures
rb_hash_* Large hashes (params, caches, configs)
malloc_consolidate Fragmentation

8️⃣ Compare two snapshots (leak detection)

This is huge.

jeprof --show_bytes --diff_base=jeprof.123.0001.heap \
       /path/to/ruby jeprof.123.0002.heap

This tells you:

  • what grew
  • what stayed stable
  • what’s leaking vs transient

If memory is “leaking”, you’ll see the same call paths grow over time.


9️⃣ Correlate jeprof with what you already measured

You observed:

  • RssAnon dominates
  • Private_Dirty dominates
  • GC doesn’t reduce RSS much

jeprof lets you answer:

  • is it Ruby heap pressure?
  • or native libs?
  • or fragmentation?

Typical outcomes:

  • JSON / Nokogiri heavy → optimize parsing, streaming, reuse
  • Hash-heavy paths → reduce object churn
  • Fragmentation-heavy → jemalloc already helps; tune sampling or threads
  • Metrics/instrumentation → disable or batch

10️⃣ Safety tips (important)

  • Never enable jemalloc profiling fleet-wide
  • Use 1 pod
  • Capture a few snapshots
  • Turn it off afterward (remove MALLOC_CONF)
  • Expect 5–15% overhead while profiling

TL;DR: Minimal working recipe

  1. Rebuild image with jemalloc + jeprof
  2. Enable via LD_PRELOAD + MALLOC_CONF
  3. kill -USR2 <worker_pid>
  4. jeprof --show_bytes ruby jeprof.*.heap
  5. Look for dominant native allocators
  6. Diff snapshots to find growth

If you want, tell me:

  • your Ruby version
  • base image (Alpine vs Debian)
  • whether you want allocation hotspots or leak detection

I can tailor the exact MALLOC_CONF and jeprof commands for your setup and traffic profile.


You can break “anonymous heap” into (A) Ruby heap, (B) native malloc, (C) allocator fragmentation/arenas, and (D) app caches that cause COW breakage — but no single metric shows all of it. The trick is to use a small toolkit and compare signals.

Below is a practical, production-friendly menu, ordered from “safe + low overhead” to “deeper / heavier”.


1) Ruby object heap: how big is it right now?

A. Ruby heap / GC stats (very low overhead)

Run inside the pod:

kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec ruby -e 'p GC.stat'"

If you can run in Rails context (loads app, heavier):

kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'pp GC.stat'"

Key fields to look at:

  • heap pages / slots (overall Ruby heap size trend)
  • malloc_increase_bytes / malloc_increase_bytes_limit (Ruby’s view of malloc growth)
  • old_objects rising without coming down (possible leak pattern)

B. Ruby object memory size (ObjSpace) (moderate overhead)

kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'require \"objspace\"; puts ObjectSpace.memsize_of_all'"

This estimates memory held by Ruby objects (not everything native).

C. Who (Ruby classes) is consuming object memory? (higher overhead, but super useful)

This is the best “show me what kinds of objects are big” view:

kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner '
require \"objspace\"
h = Hash.new(0)
ObjectSpace.each_object do |o|
  h[o.class] += ObjectSpace.memsize_of(o) rescue 0
end
h.sort_by { |_,v| -v }.first(30).each { |k,v| puts \"#{k}\t#{v/1024/1024} MiB\" }
'"

Use this when you have a quiet window; it can be expensive on large heaps.


2) Native allocations via malloc (JSON, Nokogiri, OpenSSL, etc.)

Ruby heap stats won’t capture most of this. You want allocator-level introspection.

A. Check if jemalloc is in use (quick)

kubectl exec -n <ns> -it <pod> -- sh -lc \
"cat /proc/$(pgrep -f 'puma: cluster worker' | head -n1)/maps | grep -E 'jemalloc|tcmalloc' | head"

If nothing shows, you’re likely on glibc malloc.

B. glibc malloc info from inside the process (hard without tooling)

The best way is usually to run with a different allocator (jemalloc) and compare RssAnon / Private_Dirty / PSS before vs after. That’s often more actionable than trying to introspect glibc malloc deeply in-prod.

If you can add packages, libc6-dbg + gdb can inspect malloc arenas, but that’s intrusive.


3) Allocator arenas & fragmentation

This is a huge contributor in “big RssAnon + lots of threads”.

A. Use “RssAnon vs PSS vs Private_Dirty” as the fragmentation proxy

You already did this. Fragmentation typically shows as:

  • RssAnon high
  • Private_Dirty high
  • and memory not dropping after GC / traffic dips

B. Force a GC and see if RSS/PSS moves (diagnostic only)

Inside the pod:

kubectl exec -n <ns> -it <pod> -- sh -lc \
"bundle exec rails runner 'GC.start(full_mark: true, immediate_sweep: true); sleep 1; puts GC.stat'"

Then re-check:

kubectl exec -n <ns> -it <pod> -- sh -lc \
"for p in $(pgrep -f \"puma: cluster worker\"); do echo ==$p==; egrep \"VmRSS|RssAnon\" /proc/$p/status; done"

If Ruby heap shrinks (objects freed) but RssAnon barely changes, that points to native allocations and/or fragmentation (Ruby freed objects but allocator kept pages).

Note: Forcing GC in production can hurt latency; do this only for diagnosis.

C. The clean A/B: switch allocator to jemalloc

If after enabling jemalloc your per-worker RssAnon drops meaningfully or grows slower, you’ve confirmed fragmentation/arena issues were significant.


4) Per-worker caches & COW breakers (Rails-level)

These show up as private dirty growth after fork, often correlated with first requests.

A. Compare memory right after boot vs after warm traffic

Right after startup (before much traffic), capture:

kubectl exec -n <ns> -it <pod> -- sh -lc \
"for p in $(pgrep -f \"puma: cluster worker\"); do echo ==$p==; awk \"/^(Pss|Shared_Clean|Private_Dirty):/ {print}\" /proc/$p/smaps_rollup; done"

Then after warm traffic, do it again. Big increase in Private_Dirty = cache warming / post-fork writes.

B. Check if Rails memory store cache is enabled (common culprit)

If you’re using config.cache_store = :memory_store, every worker has its own in-process cache (private memory). Switching to Redis/Memcached moves that memory out of the worker.

C. Instrumentation / metrics

Your config enables PerformanceMetricsCollector per worker when PERFORMANCE_MODE is set. Try disabling and compare:

  • worker PSS
  • worker Private_Dirty
  • worker RssAnon

5) Best “deep dive” tool: jemalloc profiling (if you can use jemalloc)

If you can run jemalloc, you can enable heap profiling and get real allocation hotspots (native + Ruby’s malloc usage).

Typical pattern (depends on image & env):

  • set LD_PRELOAD to jemalloc
  • set MALLOC_CONF for profiling
  • use jeprof to analyze

This requires changes to the container and is more involved, but it’s the most direct way to answer “which native code is allocating?”


A minimal plan that usually identifies the culprit quickly

  1. Ruby heap size & growth: GC.stat (and optionally ObjectSpace.memsize_of_all)
  2. Does GC affect RSS? Force a GC once (carefully) and compare RssAnon
  3. Cache/COW effect: compare Private_Dirty right after boot vs after warm traffic
  4. Allocator A/B: enable jemalloc in a canary and compare per-worker RssAnon + Private_Dirty

If you tell me:

  • Ruby version (2.7/3.0/3.1/3.2/3.3?)
  • whether you can change the image / add packages
  • whether PERFORMANCE_MODE is on
  • what config.cache_store is

…I’ll pick the fastest path and give you exact commands tailored to your setup (including a safe canary strategy on k8s).


You do have some sharing (~100 MiB/worker), but it’s small compared to the ~330–380 MiB/worker private dirty anonymous memory you measured. That’s why it looks like “almost no sharing”.

Why sharing is low (even with preload_app!)

Copy-on-write (COW) sharing only helps for pages that are:

  1. created before fork, and
  2. never written to after fork.

In a Rails app, a lot of memory either:

  • is allocated after fork (request handling, caches, DB connections, JSON parsing, template rendering), or
  • starts shared but gets written to in each worker (breaking COW), turning shared pages into private dirty pages.

Given your /proc/<pid>/status shows RssAnon dominates, most of your footprint is “heap-like” memory that naturally tends to become private.

Common Rails-specific COW breakers:

  • Autoloading / Zeitwerk activity after fork (loading constants lazily)
  • Rails caches warming per worker (fragment cache, memory store, i18n, view/template caches)
  • Runtime memoization (class variables, constants mutated after boot)
  • Per-worker DB/Redis client setup (connection pools, prepared statements, SSL state)
  • Instrumentation/metrics/tracing (buffers, labels, exporter state)
  • malloc arenas + fragmentation (glibc creates per-thread arenas; each worker ends up with lots of private heap pages)
  • GC behavior: Ruby heap pages get dirtied as objects are allocated/updated

So: low sharing is usually not “Puma is wrong” — it’s “your workers quickly diverge in heap and caches”.


How to improve sharing (most effective → least)

1) Make boot fully eager, so less code/data is loaded after fork

Goal: allocate as much as possible before fork, so it can be shared.

In production, ensure:

  • config.eager_load = true
  • Avoid lazy loading code paths on first request

You can force eager load in an initializer or before fork by touching parts of the app (careful—don’t do heavy one-off work that will then be mutated per worker).

Validate it’s working: after startup but before heavy traffic, check Shared_Clean is higher and Private_Dirty lower.


2) Reduce post-fork writes to big shared structures (COW friendliness)

Classic offenders:

  • big constant hashes that workers mutate
  • global caches written to on first request
  • “warming” caches in each worker

Prefer:

  • freeze large constants (.freeze)
  • avoid global mutable state
  • move caching to external stores (Redis/Memcached) when feasible

A quick code smell: ||= memoization on class-level or global objects that are touched after fork across many code paths.


3) Fix fork hygiene for DB and other connections (correctness + sometimes helps COW)

With preload_app!, you generally want:

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

on_worker_boot do
  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
  # other reconnects...
end

This prevents inherited connections and reduces weird shared state. It won’t magically halve memory, but it removes a real source of post-fork churn.


4) Use jemalloc to reduce private heap fragmentation (often a big win)

Your footprint is overwhelmingly RssAnon, so allocator behavior matters.

glibc malloc tends to:

  • create multiple arenas (especially with threads)
  • fragment over time
  • keep memory in private dirty pages

jemalloc often:

  • fragments less
  • returns/stabilizes memory better
  • reduces “mysterious” per-worker anon growth

This is one of the best “no code change” levers for exactly the profile you have.


5) Reduce worker divergence by changing concurrency strategy

If you add workers, memory scales almost linearly because each worker has its own heap.

If you instead:

  • keep workers low
  • raise threads (within DB/pool limits)

…you can get more throughput without duplicating heaps as much.

This doesn’t increase sharing, but it reduces the need for sharing.


How to measure whether improvements worked

Use the exact numbers you already have:

For each worker, track:

  • Private_Dirty (should go down or grow slower)
  • Shared_Clean (should go up after better preload/eager load)
  • Pss (should go down per worker)

Run right after boot (no traffic), then after warm traffic:

for p in $(pgrep -f "puma: cluster worker"); do
  echo "== $p ==";
  awk '/^(Pss|Shared_Clean|Shared_Dirty|Private_Clean|Private_Dirty):/ {print}' /proc/$p/smaps_rollup
done

If sharing is improved, you’ll see higher Shared_Clean and lower Private_Dirty early, and slower growth over time.


The most likely reasons in your case

Based on your numbers:

  • Private_Dirty dominates (hundreds of MiB)
  • RssAnon dominates
  • shared is only ~100 MiB

That combination strongly points to:

  1. heavy per-worker heap allocation after fork, plus
  2. allocator fragmentation, plus
  3. cache warming / lazy loads / instrumentation causing post-fork writes.

If you want, paste:

  • RAILS_ENV, RAILS_SERVE_STATIC_FILES, config/environments/production.rb values for eager_load and cache_classes
  • whether PERFORMANCE_MODE is enabled
  • Ruby version

…and I’ll tell you which 1–2 changes are most likely to move the needle for your setup (and how to validate with your PSS/shared/private metrics).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment