Skip to content

Instantly share code, notes, and snippets.

@allan-gar2x
Created April 10, 2026 05:52
Show Gist options
  • Select an option

  • Save allan-gar2x/4fe3a8067bce2b0e57f15bf68db9eddc to your computer and use it in GitHub Desktop.

Select an option

Save allan-gar2x/4fe3a8067bce2b0e57f15bf68db9eddc to your computer and use it in GitHub Desktop.
AWS Cost Investigation - Budget Alert April 10, 2026

AWS Cost Investigation — April 10, 2026

Triggered by: AWS Budget Notification — "$90/day All Accounts Budget" Account: 806877424398 (MilliononMars management) Budget: $2,700.00/month | Alert threshold: $2,295.00 (85%) | Actual (Apr 1–9): $2,480.72 Investigated: April 10, 2026 at ~12:30 PM


Bottom Line Up Front

Two overlapping problems:

  1. A real spike — Polaris Dev Bedrock usage jumped ~10x on April 6 (from ~$3/day to $118/day) and is still running at $46–73/day. Needs investigation immediately.
  2. A structural problem — the org actually spends ~$7,000–8,000/month. The $2,700 budget is ~3x underallocated. March full month actual: $7,334.

Current burn rate: ~$286/day → Month-end forecast: ~$8,600 (3.2x the budget)


Account-Level Breakdown (April 1–9, All 27 Accounts)

Rank Account Name Spend % of Total
1 209418081336 Polaris Dev $819.01 31.8%
2 806206016804 Polaris Prod $410.98 15.9%
3 924932513137 Bike4Mind Dev $389.74 15.1%
4 806877424398 MilliononMars (mgmt) $307.58 11.9%
5 191072691096 Q Portal Prod $169.35 6.6%
6 115229011504 Bike4Mind Prod $154.57 6.0%
7 137805459015 AcmePrivateCo $125.41 4.9%
8 614006037970 Q Portal Dev $102.61 4.0%
9 446072083824 ErikBethkeMoM $35.06 1.4%
10 Others (18 accts) Various $63.74 2.5%
Total $2,578.05

Top Cost Drivers (All Accounts, April 1–9)

Rank Service 9-Day Spend Annualized Pace Status
1 Amazon Bedrock $652 ~$26,000/yr 🔴 Surging
2 AWS Lambda $401 ~$16,000/yr 🟡 Elevated
3 Amazon S3 ~$220 ~$8,800/yr 🟡 Growing
4 Amazon ECS ~$180 ~$7,200/yr 🟢 Stable
5 EC2 + EC2-Other ~$175 ~$7,000/yr 🟡 Recurring spikes
6 AWS Support (Business) $100 $1,200/yr 🟢 Fixed
7 AmazonCloudWatch ~$110 ~$4,400/yr 🟡 Elevated
8 Amazon SQS ~$90 ~$3,600/yr 🟡 Elevated
9 Amazon OpenSearch ~$75 ~$3,000/yr 🟡 Watch
10 AWS WAF ~$25 ~$1,000/yr 🟢 Normal

The Smoking Gun — Polaris Dev Bedrock Spike

Something changed on or around April 5–6 on Polaris Dev (account 209418081336).

Daily Bedrock Spend — Polaris Dev

Date Spend Primary Driver
Apr 1 $10.69 Sonnet 4.5 + Haiku
Apr 2 $3.84 Mixed
Apr 3 $25.65 Opus 4.6 spike ($17.30)
Apr 4 $2.18 Light
Apr 5 $0.38 Minimal
Apr 6 $118.62 Opus 4.6: $65.52 + Sonnet 4.5: $46.78
Apr 7 $91.22 Opus 4.6: $53.46 + Sonnet 4.5: $36.44
Apr 8 $73.01 Sonnet 4.5: $40.90 + Opus 4.6: $30.99
Apr 9 $46.26 Sonnet 4.5 + Sonnet 4.6

Org-Wide Bedrock Breakdown (April 1–9)

Model Spend % of Bedrock
Claude Opus 4.6 $419.45 64%
Claude Sonnet 4.5 $170.22 26%
Claude Sonnet 4.6 $46.49 7%
Claude Haiku 4.5 $9.00 1%
Claude Opus 4.5 $7.33 1%
Total $652

Active Anomalies (43 detected since March 1)

Anomaly Dates Impact Service Account Root Cause
Polaris Dev Bedrock spike Apr 6–present ~$283 Bedrock Polaris Dev Likely runaway agent loop or batch job
S3 Storage growth Mar 3–present (ongoing 6+ wks) $45+ S3 TimedStorage Multiple Growing data, no lifecycle policies
EC2 c6a.2xlarge spikes Recurring (6+ times in 45 days) $22/event EC2 mgmt acct Periodic high-CPU workload
Lambda spike Apr 7 $24 single day Lambda Multiple Traffic spike or runaway invocation
CloudFront blow-up Mar 30–31 $243 CloudFront ErikBethkeMoM CA-region HTTPS proxy, likely bot
SQS volume Recurring (7+ times) $4–8/event SQS b4m-dev + Polaris Dev High polling frequency
Aurora RDS spike Apr 6–9 $8 Aurora ServerlessV2 Polaris Scaling events

Service-Level Deep Dives

Lambda — $401 Across All Accounts

  • Provisioned Concurrency alone: $74.67 in 9 days = ~$249/month — charged 24/7 regardless of traffic
  • Accounts running Provisioned Concurrency: b4m-prod, b4m-dev, AcmePrivateCo
  • Lambda GB-Second execution ($324.54) implies high-duration or memory-heavy functions

S3 — Growing Across All Accounts (6+ Weeks, No Lifecycle Policies)

  • b4m-dev: 4,428 GB-hours of storage at $101.86 in 9 days
  • Anomaly has been ongoing since March 3 — no cleanup policies active
  • Affected: Polaris Dev, b4m-dev, b4m-prod, AcmePrivateCo, Polaris Prod

CloudWatch — Q Portal Prod Anomalous

  • Q Portal Prod (191072691096): $77.96 on CloudWatch in 9 days — 46% of its total bill
  • Root cause: USE2-Application-Signals-Bytes — Application Signals enabled with no byte-limit controls
  • Likely indexing verbose logs with no sampling rate

SQS — 82 Million Requests on b4m-dev in 9 Days

  • 82.7M standard + 1.9M FIFO SQS requests on b4m-dev
  • At $0.40/million = ~$110/month pace
  • Likely Lambda-based queue polling with short polling intervals or high-frequency processing loops
  • Recurring anomaly flagged 7+ times since March 1

Root Cause Classification

Issue Classification Financial Impact
Polaris Dev Bedrock spike (Apr 6+) New infra / runaway agent process ~$90–120/day ongoing
Bedrock Opus 4.6 org-wide Traffic growth / model selection $419 in 9 days
Lambda Provisioned Concurrency Configuration — always-on charge ~$249/month structural
S3 storage growth No lifecycle cleanup Growing ~$50+/month
Q Portal CloudWatch App Signals Misconfiguration $78 in 9 days on one account
EC2 c6a.2xlarge recurring Scheduled/periodic, unoptimized $10–12/event × 6+ times
SQS high volume Architecture — polling frequency ~$110/month recurring
Budget itself Structural — 3x underallocated vs actual $7,334 actual in March

Recommendations (Priority-Ranked — Verification Only, No Changes Made)

P0 — Immediate Investigation Required

P0-1: Identify what changed on Polaris Dev on/around April 6 Check deployments, agent runs, or batch jobs started on account 209418081336 around April 5–6. The Bedrock spend jumped from $2–3/day to $118/day and is still running at $46–73/day.

  • Estimated savings if resolved: ~$1,500–2,100/month

P0-2: Review Bedrock model selection — Opus 4.6 is 64% of Bedrock spend Evaluate whether Claude Sonnet 4.5/4.6 can handle most requests (3–5x cheaper than Opus 4.6). Model routing strategy (Haiku for classification, Sonnet for generation, Opus only for complex reasoning) could reduce Bedrock costs 40–60%.

  • Estimated savings: $700–1,200/month

P1 — Fix This Week

P1-1: Disable Lambda Provisioned Concurrency on dev stage Provisioned Concurrency on staging environments charges 24/7 regardless of traffic. Unless hard cold-start SLAs are required on staging, remove it from dev.

  • Estimated savings: $60–90/month

P1-2: Fix Q Portal Prod CloudWatch Application Signals Disable Application Signals or add sampling/filtering on account 191072691096.

  • Estimated savings: $200–250/month

P1-3: Implement S3 Lifecycle Policies across all accounts Transition objects >30–90 days to Intelligent-Tiering or Glacier. Delete old versions and incomplete multipart uploads.

  • Accounts to prioritize: b4m-dev, Polaris Dev, b4m-prod, AcmePrivateCo
  • Estimated savings: $50–150/month

P1-4: Add per-account Bedrock anomaly alerts Create Cost Anomaly Detection monitors for Bedrock with daily thresholds: $20 on dev accounts, $50 on prod accounts.

P2 — Fix Within the Next Sprint

P2-1: Investigate SQS polling frequency on b4m-dev and Polaris Dev Review Lambda event source mappings. Consider longer polling intervals, larger batch sizes, and DLQ configurations to prevent retry storms.

  • Estimated savings: $30–60/month

P2-2: Investigate recurring EC2 c6a.2xlarge on management account 6 anomaly events in 45 days. Convert to Spot Instances or right-size if workload allows.

  • Estimated savings: $40–60/month

P2-3: Review OpenSearch sizing Polaris Dev ($17) and Polaris Prod ($58) — confirm clusters are appropriately sized and not running idle indexes.

P3 — Structural / Strategic

P3-1: Right-size the budget Actual spend is ~$7,000–8,000/month. The $2,700 budget triggers alerts constantly and creates noise. Either raise the budget to reflect reality or create per-account/per-project budgets for meaningful attribution.

P3-2: Implement Bedrock cost allocation tags Tag all Bedrock API calls with project, team, and environment to understand model usage by feature/product.

P3-3: Consider Bedrock Reserved Capacity for consistent base load If Sonnet-tier usage is consistent (which it appears to be for b4m-prod), explore Bedrock provisioned throughput for baseline workloads.


Investigation performed: April 10, 2026 | No changes made — verification only Tools: AWS Cost Explorer, Cost Anomaly Detection, CloudWatch Metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment