why-was-the-widget-so-slow.md

Why Was The Widget So Slow?

The Restaurant Analogy

Imagine a restaurant with 4 waiters (that's Kestrel's thread pool).

The old code worked like this: a waiter takes your order, walks to the kitchen, and stands there staring at the chef until the food is ready. They don't take any other orders while waiting. If 4 tables order at the same time, every waiter is standing in the kitchen doing nothing, and the 5th table? They can't even get a glass of water. That's thread starvation.

The fix: the waiter drops off the order, goes back to serve other tables, and picks up the food when the kitchen rings the bell. That's await.

But It Worked Fine Before!

On ECS, the app ran on IIS which is basically a restaurant with 400 waiters. You can afford to have most of them standing around in the kitchen — there are always more available. Nobody noticed the inefficiency because brute force solved it.

Kestrel (k8s) is designed to run with 4 very efficient waiters. It assumes they won't stand around waiting. When someone writes db.Reviews.ToList() instead of await db.Reviews.ToListAsync(), you've just sent a waiter to stare at the kitchen.

The Greatest Hits

The Badges Endpoint had a foreach loop over products. For each product it would:

Query the database for reviews (waiter goes to kitchen, stares)
Query the database for questions (waiter goes to kitchen again, stares again)
Read from Redis (waiter walks to the bar, stares)
Write to Redis (walks back to bar, stares some more)

16 products × 4 blocking calls = 64 waiters standing around doing nothing. With 4 available, you do the math.

We replaced this with 2 bulk queries upfront, then filtered in memory. 9.85s → 3.39s.

The Widget Endpoint had 7 separate SELECT COUNT(*) queries to get the star rating breakdown — one for each star rating plus recommend counts. That's 7 round trips to the kitchen when you could just ask "give me all the counts grouped by rating" once.

It also had the usual AsyncHelper.RunSync pattern everywhere — which is the coding equivalent of calling the kitchen on the phone, putting yourself on hold, and blocking the phone line so nobody else can use it. In an async method. Inside a restaurant that only has 4 phone lines.

The SQS Surprise

The review rendering loop calls SubmitReviewForCensorProcess for every uncensored review. This sends a synchronous AWS SQS message per review. Each one is a network round trip to AWS. For a shop with 16 reviews displayed, that's 16 sequential API calls to Amazon during a widget page load. In the render loop. For a read-only endpoint. Someone really wanted those reviews censored right now.

What We Fixed

What	Before	After
Badges N+1 queries	2 DB queries per product	2 queries total
Badges 16 products	9.85s	3.39s
Widget rating counts	7 separate COUNTs	1 GroupBy
Widget sku_only path	N+1 loop + in-memory queryable that crashes with async	Single IN query, proper IQueryable
Blocking calls	`AsyncHelper.RunSync` everywhere	`await` everywhere
Redis	Hardcoded ElastiCache, sync calls, legacy HGET fallback	Configurable via env vars, async, no HGET
Error handling	Silent `catch {}` swallowing everything	Report to Sentry
Debug noise	~90 lines of `Stopwatch`/`Debugging`/`Console.WriteLine`	Deleted

Active Requests: Before vs After

Before the async fix, pods accumulated 2000+ active connections and health checks timed out. After:

8 pods handling 15% of production traffic
6-20 active requests per pod (was 2000+)
Zero restarts
Thread pool: ~10 threads per pod (healthy)

We bumped to 30% traffic and it's holding steady.

Still TODO

The per-review SQS censor call should be batched or moved out of the render path
GetProductGroupIds2 opens its own DbContext per call
More sync callers exist outside the widget/badges hot path

diegoeche/why-was-the-widget-so-slow.md

Select an option

No results found

Select an option

No results found