Skip to content

Instantly share code, notes, and snippets.

@diegoeche
Created March 26, 2026 03:02
Show Gist options
  • Select an option

  • Save diegoeche/3a93a657181d5662c3ef6cd49668d090 to your computer and use it in GitHub Desktop.

Select an option

Save diegoeche/3a93a657181d5662c3ef6cd49668d090 to your computer and use it in GitHub Desktop.
The Varnish Cache Incident: A Post-Mortem in Three Facepalms

The Varnish Cache Incident: A Post-Mortem in Three Facepalms

March 26, 2026 — prod namespace


Background

Varnish was getting OOM-killed. We increased memory limits. The OOMs stopped, but the cache hit ratio looked wrong. We went to investigate and found the dashboard itself was lying to us. What followed was a cascade of "wait, that's not right either" moments.

Facepalm #1: The Dashboard

The cache hit ratio query used max by (job) across 3 varnish pods — picking the highest hit rate from one pod and the highest miss rate from a different pod, then dividing. Meaningless math, confidently displayed on a gauge.

Fixed to sum(). Numbers changed. Numbers still didn't add up.

Facepalm #2: The Missing 900 req/s

Widget-proxy was reporting 1.6K req/s. Varnish hits + misses totaled ~700 req/s. We were losing half our traffic somewhere.

Turns out hit + miss isn't the whole picture. Varnish also has pass — requests that skip the cache entirely. 873 req/s of pass-through traffic, invisible in the dashboard because nobody thought to include it.

The culprit: a URL regex in the VCL that was supposed to cache POST /api/widget/badges:

req.url ~ "^/api/svc/backend-widget-proxy/widget/badges"

The actual URL varnish sees: /api/widget/badges. The regex never matched. Every badges POST fell through to return (pass). This has been broken since the VCL was written. We just couldn't see it because the dashboard was also broken.

One-line fix. Feeling smart. Push. Check metrics. Pass rate unchanged.

Turns out we forgot to restart varnish — the configmap updated but varnish reads VCL at startup. We'd been congratulating ourselves while the old VCL was still running. Restart. Check again.

Facepalm #3: 405 Method Not Allowed (Cached!)

All badges requests now returning 405. Varnish has a fun default behavior: when a POST goes through hash and misses, it converts the backend fetch to GET. Widget-proxy doesn't accept GET on that endpoint. Varnish then caches the 405 and serves it on subsequent hits.

We needed vcl_backend_fetch to restore the POST method. Added it, waited for Flux, restarted again, and finally:

5788x  200  ✓
  98x  401  (bad API keys)
  21x  404  (unknown shops)
   0x  405  gone

The Actual Fix for the OOMs

With badges properly cached, pass-through traffic dropped dramatically. Those ~900 req/s of pass requests weren't just uncached — each one held a varnish worker thread open for the full duration of the backend round-trip. Badges requests come from slow client connections all over the world, so each pass tied up a thread waiting on a sluggish socket. Multiply that by 900/s and you get thread exhaustion, backed-up connections, and ballooning memory.

After the fix: worker threads freed up, CPU usage dropped, memory usage dropped. The OOMs weren't caused by the cache being too big for its limits — they were caused by a regex that didn't match.

What We Fixed

  1. Dashboard: per-service memory/CPU charts with limits, fixed hit ratio to use hit/client_req, added pass/synth to the breakdown, made it stacked
  2. VCL URL: /api/svc/backend-widget-proxy/widget/badges/api/widget/badges
  3. VCL POST preservation: added vcl_backend_fetch to stop varnish from "helpfully" converting POST to GET
Metric Before After
Badges caching Broken since day one Working
Pass-through traffic ~873 req/s Dropping
Worker threads / CPU / Memory OOM territory Chill
Dashboard accuracy Creative fiction Mostly honest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment