The Varnish Cache Incident: A Post-Mortem in Three Facepalms

March 26, 2026 — prod namespace

Background

Varnish was getting OOM-killed. We increased memory limits. The OOMs stopped, but the cache hit ratio looked wrong. We went to investigate and found the dashboard itself was lying to us. What followed was a cascade of "wait, that's not right either" moments.

Facepalm #1: The Dashboard

The cache hit ratio query used max by (job) across 3 varnish pods — picking the highest hit rate from one pod and the highest miss rate from a different pod, then dividing. Meaningless math, confidently displayed on a gauge.

Fixed to sum(). Numbers changed. Numbers still didn't add up.

Facepalm #2: The Missing 900 req/s

Widget-proxy was reporting 1.6K req/s. Varnish hits + misses totaled ~700 req/s. We were losing half our traffic somewhere.

Turns out hit + miss isn't the whole picture. Varnish also has pass — requests that skip the cache entirely. 873 req/s of pass-through traffic, invisible in the dashboard because nobody thought to include it.

The culprit: a URL regex in the VCL that was supposed to cache POST /api/widget/badges:

req.url ~ "^/api/svc/backend-widget-proxy/widget/badges"

The actual URL varnish sees: /api/widget/badges. The regex never matched. Every badges POST fell through to return (pass). This has been broken since the VCL was written. We just couldn't see it because the dashboard was also broken.

One-line fix. Feeling smart. Push. Check metrics. Pass rate unchanged.

Turns out we forgot to restart varnish — the configmap updated but varnish reads VCL at startup. We'd been congratulating ourselves while the old VCL was still running. Restart. Check again.

Facepalm #3: 405 Method Not Allowed (Cached!)

All badges requests now returning 405. Varnish has a fun default behavior: when a POST goes through hash and misses, it converts the backend fetch to GET. Widget-proxy doesn't accept GET on that endpoint. Varnish then caches the 405 and serves it on subsequent hits.

We needed vcl_backend_fetch to restore the POST method. Added it, waited for Flux, restarted again, and finally:

5788x  200  ✓
  98x  401  (bad API keys)
  21x  404  (unknown shops)
   0x  405  gone

The Actual Fix for the OOMs

With badges properly cached, pass-through traffic dropped dramatically. Those ~900 req/s of pass requests weren't just uncached — each one held a varnish worker thread open for the full duration of the backend round-trip. Badges requests come from slow client connections all over the world, so each pass tied up a thread waiting on a sluggish socket. Multiply that by 900/s and you get thread exhaustion, backed-up connections, and ballooning memory.

After the fix: worker threads freed up, CPU usage dropped, memory usage dropped. The OOMs weren't caused by the cache being too big for its limits — they were caused by a regex that didn't match.

What We Fixed

Dashboard: per-service memory/CPU charts with limits, fixed hit ratio to use hit/client_req, added pass/synth to the breakdown, made it stacked
VCL URL: /api/svc/backend-widget-proxy/widget/badges → /api/widget/badges
VCL POST preservation: added vcl_backend_fetch to stop varnish from "helpfully" converting POST to GET

Metric	Before	After
Badges caching	Broken since day one	Working
Pass-through traffic	~873 req/s	Dropping
Worker threads / CPU / Memory	OOM territory	Chill
Dashboard accuracy	Creative fiction	Mostly honest

diegoeche/battle-report.md

Select an option

No results found