Production was down ~2 hours (7:30–9:30 AM ET). No data loss.
Cause: ES license expired → web workers hung → thundering herd on recovery → exposed latent N+1-style COUNT queries in the asset show endpoint that couldn't survive 130 concurrent connections. DB pinned at 100% CPU. Compounded by ~100K queued Sidekiq jobs and a Kintzing API client hitting us at 140 req/s.
Fix: Shipped 5 targeted fixes during the incident — eliminated 6+ expensive COUNT queries per asset show (upload progress counts, contribution counts, requirement checks). These now use cached counter columns or are skipped entirely for done uploads and API traffic.