Post-mortem: 2026-06-07 → 2026-06-11 — Helius saturation, Postgres flap, and the safety nets that weren't

Authors: Stephen Margheim, with assistance from Claude Status: Resolved; resilience follow-ups in flight

TL;DR

Helius webhooks cap each project at 95k addresses. We knew this would eventually bind, and we'd built an auto-scaling provisioner ahead of time. When the cap was hit on 2026-06-07, the provisioner broke in a way that nothing alerted on. From that single silent failure, a four-day cascade unfolded: 86k wallets stopped receiving webhooks; the recovery itself exhausted Helius credits; credit exhaustion paused webhooks for everyone, not just the backlog; a separate silent failure inside WalletCatchUpJob swallowed transient 429s as "permanent," so the recovery itself was lossy; and underneath all of it, Postgres crashed repeatedly under load.

Metric	Value
Time auto-scale was broken	~64 hours (06-07 16:54 → 06-10 09:00 UTC)
Solana addresses left unregistered with Helius at peak	86,697
Known missed customer deposits during the window	≥7 (recovered)
Transfers permanently failed by 429s during recovery	8 (recovered)
Honeybadger faults that fired before customers noticed	0
Funds lost	$0

The story

A cap we expected, a guard we'd built

Helius limits each webhook to 95,000 addresses. As the project's first webhook filled up, WalletRegistrationJob was supposed to call Webhook::Helius.provision!, get a fresh webhook from Helius, and continue batching new addresses onto it. The code existed. It had been written months earlier specifically to anticipate this moment.

It had never run in production. The original webhook had been created manually, seeded with addresses, so the provisioner's empty-address path had no end-to-end coverage:

def self.provision!(chain:)
  response = Clients::Helius.new.create_webhook   # addresses: [] by default
  create!(chain: chain, external_id: response["webhookID"], address_count: 0)
end

Helius rejects empty-address create requests with 400 Bad Request: "At least one account address is required". WalletRegistrationJob rescued the resulting Clients::ClientError, wrote Rails.logger.error("permanent failure (400): ..."), and returned. No Honeybadger.notify. No alert. No monitor.

64 hours of silence

2026-06-07 16:54 UTC: the existing webhook ticked over to 95,000 addresses. From that moment, every Solana address derived for every new wallet was queued for registration, hit the same 400, and was logged-and-forgotten. Hundreds of times per hour. Customers depositing to those addresses received no webhook — no TransferSwapJob, no settlement — and as far as our dashboards were concerned, the system was perfectly healthy.

At the same time, wallet creation traffic was running ~700× our baseline: hourly creation peaked at 7,641 on 2026-06-08 14:00 UTC, against a normal 4–24 per hour. Two things compounded on Postgres: the write volume from wallet creation itself, and the broken registration job pulling all webhook_id IS NULL rows into memory on every invocation (no BATCH_SIZE cap in the pre-PR-#137 code).

06-07 22:00 UTC: the first cluster of ActiveRecord::ConnectionNotEstablished errors appeared. Render's basic_1gb Postgres plan has 1 GB RAM shared between PG, OS, and buffers. Memory climbed from 288 MB at 16:00 to 484 MB by 22:15, then snapped to 234 MB at 22:20 — a process restart, not allocation churn. The banner that followed (the database system is in recovery mode) is emitted only when PG performs WAL replay after a crash. Graceful shutdowns skip recovery. This pattern repeated through 06-08 and 06-09 as load oscillated. Each restart dropped the TCP listener for seconds to several minutes; the web tier's DatabaseConnectionRetry middleware absorbed the short ones, but the longer recoveries surfaced as 5xx.

(I initially attributed these PG events to Render maintenance restarts. Stephen pushed back. The 90-day baseline showed zero such errors prior; the metrics showed a memory-and-CPU profile of a crash; the messages were recovery-mode banners. It was crashes under load, not maintenance.)

Discovery, and the mass-catchup that lit a fire

2026-06-10 ~07:00 UTC: customer-support reports started arriving — "I sent USDT but never received USDC." Stephen manually ran WalletCatchUpJob.perform_now for the reported wallets while the investigation started.

~08:30: a grep of Render logs found the smoking gun: WalletRegistrationJob permanent failure (400) firing every ~30 seconds since 06-07 16:54.

~09:00: SQL confirmed the blast radius — 86,697 Solana addresses had webhook_id IS NULL, all created after the cap was hit.

~09:30: PR #137 opened, with three changes: require seed addresses in provision!, Honeybadger.notify on permanent failure in both registration jobs (no more silent swallowing), and a recurring UnregisteredAddressesCheckJob that would have caught the original incident within ten minutes.

~10:00: the unblock script ran in console. A single seeded address provisioned a new Helius webhook, then WalletRegistrationJob.perform_later fanned out across the 86k backlog. Solid Queue depth peaked near 64k catch-up jobs.

We didn't yet know that the recovery itself was about to break the system.

Credit exhaustion: the recovery cost more than the bug

WalletCatchUpJob calls Clients::Helius#get_address_transactions, which goes to Helius's Enhanced Transactions endpoint. Per Helius's published costs, Enhanced Transactions is 100 credits per call — not 1, as we'd assumed. Our Developer plan includes 10M credits/month. The 86k catch-up calls were budgeted to cost ~8.67M credits — 87% of the monthly allotment, in one drain.

A subtler point that surfaced afterward: the bug didn't change the total credits we'd spend in this billing cycle, only the timing. Without the bug, the same 86k wallets would each have been registered + caught up in real time over the 64-hour window — same 100 credits/call, same 8.67M total, just spread out. We were structurally under-provisioned for our wallet-creation rate (~1,355/hour during the bug window projects to ~98M monthly catch-up credits against the 10M plan); the bug just decided which afternoon the cap hit.

06-10 11:14–11:18 UTC: Helius started returning 429: max usage reached. Both the REST API and the Solana RPC behind the Solace gem throttled. Webhook deliveries themselves are also credit-billable on Helius, so as credits exhausted, inbound webhooks paused too — not just our outbound calls. This window is responsible for at least one separately-investigated missed deposit (signature uVKdn4c2...) where the deposit address was registered correctly but no webhook ever fired.

A separate failure flared during this window: Solace::Errors::HTTPError was not in Transfers::Swap::RETRYABLE_ERRORS. A 429 — definitionally transient — dropped 8 Transfers into terminal failed state instead of swap_retrying.

Helius dashboard alerts fired: 75% of autoscaling, then 100% of monthly. We manually bumped credits and the system stabilized through the afternoon.

The catchup's own silent failure

There was one more layer to unpeel, and we didn't find it until 2026-06-11 when another customer reported a missed deposit. Tx signature 3bwgGR..., 1.016918 USDC, deposited 2026-06-09 18:43 UTC.

The catchup we'd run yesterday should have caught this. The SQL trail showed that it had — WalletCatchUpJob for this wallet was enqueued at 10-Jun 10:22:25, sat 31 min behind the 64k backlog, and ran at 10:53:56. It completed without exception. But no Transfer ever got created.

The reason was in the prod version of the job:

def perform(wallet)
  # ...
  transactions = helius.get_address_transactions(address.address)
  Transfers::ProcessTransactions.call(transactions:)
rescue Clients::ClientError
  # 4xx errors are permanent — discard silently
end

The job ran during the 429 spike. Honeybadger Insights shows 17 ClientError 429: max usage reached / rate limited events in the five-minute window covering 10:50–10:55 UTC. The catchup's get_address_transactions hit one of them. Clients::Base raises ClientError for the entire 4xx band; the rescue treated it as permanent and threw the deposit away.

429 is not permanent. The comment was wrong. Every catchup that happened to fire during the credit-exhaustion window silently lost its deposit the same way. We don't know yet how many other deposits this affected; the manual recoveries we've done so far have been customer-reported.

This is the most important finding from the whole weekend. The recovery mechanism we were leaning on was lossy at exactly the moment it was needed.

Stabilization

06-10 ~12:00–14:00 UTC: three more PRs went out — SLO-named queues so that swap-the-user's-deposit can't sit behind 64k catch-up calls again (#138), a read-only admin SQL console so the next incident doesn't need MCP round-trips (#139), and the Solace::Errors::HTTPError-in-RETRYABLE_ERRORS fix (#140). All known affected customer transfers were recovered. No funds were lost; only delayed.

06-11 07:27 UTC: the most recent missed deposit was caught up manually and the 429-swallow root cause was confirmed. The fix for that one is still in flight.

What it cost

Surface	Impact
Customer	≥7 deposits missed during the unregistered-address window; 8 transfers marked `failed` during credit exhaustion; 6 transfers stuck `deposit_detected` during queue starvation. All recovered. No funds lost.
Operational	~5 hours active investigation on 06-10, plus follow-up on 06-11. 64 hours of accumulated silent breakage before that.
Trust	Customer-support reports on a weekend, with no internal signal that anything was wrong.

Resilience work shipped

PR	Status	Effect
#136	merged	Widened `DatabaseConnectionRetry` middleware backoffs from ~2.5 s to ~16 s; raised PG `connect_timeout` 2 s → 5 s. Web absorbs short PG recoveries before propagating 5xx. Symptom treatment, not cure.
#137	merged	Fixed `provision!` to require seed addresses; `Honeybadger.notify` on permanent failure in both registration jobs; `UnregisteredAddressesCheckJob` (every 5 min, pages on `webhook_id IS NULL` older than 10 min) — would have caught the original incident within minutes; `BATCH_SIZE = 5000` cap on registration job to bound memory; partial index `chain_addresses(chain_id, created_at) WHERE webhook_id IS NULL`.
#138	open	SLO-named queues (`within_1_minute / 5_minutes / 1_hour / 1_day`) with strict priority worker config; `job.completed{queue, job_class, slo_met}` counter to Honeybadger Insights so SLO compliance is observable; `retry_on` PG-connection errors in `ApplicationJob` so worker jobs survive future PG restarts.
#139	open	`POST /admin/sql` — HTTP-Basic-authed read-only SQL endpoint backed by `SET TRANSACTION READ ONLY` + `DECLARE CURSOR`. Cuts the round-trip during incidents.
#140	open	Adds `Solace::Errors::HTTPError` and `CircuitBreaker::CircuitOpenError` to `RETRYABLE_ERRORS` in `Transfers::Swap` and `Transfers::Distribute`. Future 429s won't terminally-fail transfers.
#141	open	Cheap-first `WalletCatchUpJob` — 1-credit `getSignaturesForAddress` check before the 100-credit Enhanced Transactions parse. ~50× credit reduction at our current scale. Adds `WalletCatchUpSweepJob` recurring safety net (filter on `webhook_id IS NOT NULL`, cheap-check, escalate only when activity exists).

Roadmap

In rough priority order.

Item	Why
Reclassify 429 as transient in `Clients::Base`	The single most impactful fix on this list. Today's discovery: every 4xx — including 429 — is rescued as a permanent `ClientError`. The fix is to raise a distinct `Clients::RateLimitedError < ServerError` for 429 so it joins `TRANSIENT_ERRORS` and gets retry + circuit-breaker treatment everywhere. Affects `WalletCatchUpJob`, `WalletRegistrationJob`, `EVMWalletRegistrationJob`, the new sweep, and any future Helius caller.
Upgrade Postgres to `pro_4gb`	Crashes recur on the next traffic spike if unchanged. 4× RAM headroom is the actual lever; HA is a secondary benefit. Dashboard-only change.
Upgrade Helius plan or split into two projects	At current scale we project ~98M catch-up credits/month against the 10M Developer-plan allotment. Two separate Helius projects (not just two keys — keys share the project's budget) gives true isolation between real-time webhook delivery and backfill.
Helius credit/autoscaling monitor	Page at 70% of monthly OR 70% of autoscaling. Webhook deliveries pause at the autoscaling cap, so this is a customer-visible threshold, not just a billing one. Would have warned us hours before the exhaustion.
Drift reconciliation job	At least 22 addresses currently in our DB as registered are not actually in Helius's webhook list — silent failures from earlier 4xx handling. An hourly diff job either nulls the local `webhook_id` (to re-register) or PUTs the missing addresses back to Helius.
Schedule `WalletCatchUpSweepJob`	Build is done in PR #141; cadence pending the Helius plan-tier decision. Hourly is responsive (~95k credits/hour at current scale); daily is safer.
Per-Address `last_seen_signature`	Today the sweep uses the wallet's last deposit signature as its cursor. Wallets with only non-deposit activity (NFTs, spam, etc.) get caught up indefinitely. A per-Address `last_seen_signature` column closes that gap.
Throttle `WalletCatchUpJob` enqueue rate during recoveries	Smooth the per-second rate-limit curve during long drains. ~30 LOC, token-bucket or random jitter.
Recovery runbook	Document the established recovery primitives — `WalletCatchUpJob.perform_now`, `TransferRetryJob.perform_later`, `/admin/sql` — so the next responder doesn't have to reconstruct them.

Lessons

A permanent-looking failure must page. Rails.logger.error("permanent failure") + return is how this incident stayed invisible for 64 hours. If a job decides it cannot make progress, the next step is Honeybadger.notify, not a log line.
Recurring monitors catch what code can't. UnregisteredAddressesCheckJob (one query, every 5 min) would have alerted at minute 11. Always pair a code path with an external invariant check on its observable effect.
"Transient" is a property of the error class list, not of operator memory. Both Solace::Errors::HTTPError and a 429 from Helius should have been retryable. Neither was. Every error category gets explicitly classified into TRANSIENT_ERRORS or it's permanent by default — and "permanent" must mean we actively chose that.
Read the upstream's pricing model before reasoning about it. I argued for a full revision cycle that we wouldn't have hit Helius's cap organically, on a mental model where 1 call = 1 credit. Enhanced Transactions is 100 credits/call. The Insights data that looked like evidence of "bursts cause exhaustion" was actually evidence of when the cumulative budget was exhausted, not whether.
Check the baseline before anchoring on a hypothesis. Twice I attributed something to a familiar cause (Render maintenance, 1-credit calls) without first asking "what does the baseline look like?" Both times the 90-day data showed I was wrong. The cheap diagnostic ("what does this look like normally?") belongs at the start of the investigation, not after pushback.
Use the cheapest tool for the question. WalletCatchUpJob was paying 100 credits per call to answer "does this address have any activity?" — when a 1-credit standard RPC answers exactly that. Apply this everywhere: every external call should answer the narrowest question we need, and escalate only when the cheap answer is "yes."
A backfill workload sharing a credit budget with real-time customer flow is a single point of failure. Even after we upgrade Helius, the backfill drain shouldn't be able to consume credits the customer path needs.
Investigation tooling deserves the same priority as feature work. PR #139 wasn't on any roadmap; the incident made it obvious. Build the diagnostic surfaces before the next incident makes them obvious.

Appendix

Helius per-call credit costs (docs): Standard RPC = 1; Enhanced Transactions = 100; Webhook event delivered = 1. Credits cumulative monthly, no burst/rate component.
Wallet creation rate during the bug window: baseline 4–24/hour through 06-05; peak 7,641/hour at 06-08 14:00 UTC; back to hundreds/hour by 06-09 evening.
Postgres memory at first crash (06-07): 288 MB → 484 MB peak → 234 MB after restart. CPU 5% → 25%+ over the same window. basic_1gb plan = 1 GB RAM shared with OS/buffers.
PG connection-error baseline: 0 in the prior 90 days; 245 across 06-07 → 06-10.
Helius 429 spike on 06-10: 248 notices total, 17 in the five-min window covering 10:50–10:55 UTC — exactly when WalletCatchUpJob for the wallet investigated on 06-11 ran.
Backlog drain rate measured: ~30–60 jobs/sec at JOB_CONCURRENCY=4 × 4 instances × 3 threads. 64k jobs drained in ~30 min. Limited by Helius rate, not by us.

fractaledmind/2026-06-07-helius-webhook-saturation.md

Select an option

No results found