Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active April 10, 2026 21:04
Show Gist options
  • Select an option

  • Save arubis/119e583e7aa77ec5d96c241aa958fbb7 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/119e583e7aa77ec5d96c241aa958fbb7 to your computer and use it in GitHub Desktop.
Review: oncall_overbroad_silence_audit v10 — APPROVE

Review: oncall_overbroad_silence_audit (v10)

UUID: f276673e-1f49-423a-aee3-45e7b5c940ee Author: marianvladoi_41454 Category: DevOps / Hard


Verdict: APPROVE


Eval Summary

Backend Runs Mean Score Threshold Status
Teapot (nighthawk) 15 0.690 ≤0.85 PASSES
Docker (biggie-nebula) 5 0.802 <0.50 N/A (teapot takes precedence)

Score distribution (teapot): 0.0 × 3, 0.67 × 5, 1.0 × 7

Per-Check Breakdown (Teapot, 15 runs)

Subscore Weight Pass Rate Notes
silence_durably_remediated 0.33 12/15 (80%) Has variance
rule_loaded_and_sane 0.33 7/15 (47%) Main differentiator
rule_survives_restart 0.34 12/15 (80%) Has variance

All three subscores show variance. No dead weights.


Failure Analysis

Failure Mode Classification (Teapot, 15 runs)

Category Runs Scores Root Cause
Infra failure (DiskPressure) 3 0, 0, 0 Boot-time image GC, unrecoverable
Genuine: wrong expression 5 0.67 × 5 Invented phantom metrics, custom exporters
Perfect score 7 1.0 × 7 Found both reconcilers + used grafana_alerting_silences

Genuine Failures

1. Missing second reconciler (silence_durably_remediated=0) Agents find the primary reconciler (platform-policy-sync in monitoring) but miss the guard (cluster-config-monitor in kube-system). The RBAC design elegantly constrains discovery — agents can't kubectl get pods -n kube-system but CAN read ConfigMaps there. Winning agents used lateral techniques (IP reverse-lookup, brute-force CM listing).

2. Wrong expression for alerting rule (rule_loaded_and_sane=0) This is the main challenge. Agents split into two camps:

  • Successful: Use grafana_alerting_silences{state="active"} > 0 with for: 4h — the for: clause encodes the threshold. Rule immediately enters "pending" state.
  • Failing: Invent phantom metrics (grafana_silence_age_seconds, grafana_silence_start_timestamp_seconds) that don't exist, or build custom exporters with per-silence duration metrics that return empty because no silence is actually 4h old at grading time. Rule stays "inactive".

The task hint ("think about how Prometheus itself can encode the duration threshold") nudges toward for:, but many agents still try timestamp arithmetic.

3. Complete failures (score=0) Three teapot runs scored 0.0. All three share the same root cause: boot-time DiskPressure from reserved filesystem blocks (tune2fs -m 5 reserves ~200GB). Kubelet triggers image GC; since the cluster is air-gapped, Grafana/Prometheus images can't be re-pulled. Agents correctly diagnose the issue but can't recover. These are infrastructure failures, not task failures.

Artificial Failures

None definitively identified beyond the infra DiskPressure issue.


Grader Quality

Weights

0.33/0.33/0.34 — proper thirds rounding. PASS

Wait Times

  • Silence durability: 2 × 90s = 180s (covers two reconciler cycles at 30s/35s). Appropriate
  • Rule loaded: 120s polling (covers multiple Prometheus evaluation intervals). Appropriate
  • Restart survival: 90s polling after pod restart. Appropriate

Alignment with task.yaml

Every grader check maps directly to an explicit task.yaml requirement. No undisclosed requirements. The for: >= 60s check from earlier versions has been removed in v10. PASS

Partial Grading

Each subscore is a working milestone, not a prerequisite check. Agent cannot score > 0.33 without real progress. PASS


Task Design

Strengths:

  • Dual reconciler pattern with RBAC-gated discovery is an elegant difficulty mechanism
  • Decoy silences (19 + regex matchers) and decoy deployments (5 in monitoring) create genuine investigation noise
  • Disabled /-/reload forces agents to understand pod lifecycle for Prometheus reloading
  • The for: clause insight separates agents who understand Prometheus semantics from those who don't
  • Task prompt is clear about objectives without revealing grading criteria

Scope: Substantial (≥4h). Agent must: investigate silences, discover reconcilers across namespaces, neutralize both, understand Prometheus scraping/metrics, write correct rule expression, persist to ConfigMap, handle pod restart for reload.


Bot Review Findings vs v10 Reality

The Discord bot ran 6 reviews flagging several issues. All have been resolved in v10:

Bot Finding Status in v10
Unequal weights (0.4/0.3/0.3) Fixed → 0.33/0.33/0.34
Undisclosed for: >= 60s requirement Fixed → removed entirely
Dead weight rule_loaded_and_sane (always 0) Fixed → 47% pass rate with variance
Contradictory requirements (removing silence kills rule data) Fixed → decoys + scrape annotation provide data
Zero variance across all runs Fixed → 3 distinct score clusters

Criteria Checklist

Criterion Status Evidence
Solvable PASS Multiple 1.0 scores on both backends
Challenging PASS Teapot mean 0.690 ≤ 0.85
Substantial PASS Multi-phase task requiring investigation, reconciler discovery, RBAC-gated namespace traversal, Prometheus metric selection, ConfigMap persistence
Grader quality PASS Proper wait times, aligned checks, equal weights (0.33/0.33/0.34)
Variance PASS All 3 subscores show variance; 3 distinct score clusters
No undisclosed requirements PASS for: check removed in v10; all grader checks trace to task.yaml

Grader Robustness

Timing

No issues. The three checks run sequentially (silence 180s → rule 120s → restart 210s, ~8.5 min worst case). Check 1's 180s sleep provides implicit grace time for Prometheus to scrape and evaluate the rule before check 2 starts. The pod termination race (a reconciler's final POST landing during check 1's durability window) is genuine difficulty, not a timing bug — agents should ensure pods are fully gone before deleting silences.

Agent Fairness

The pending/firing state check favors for:-based approaches over expression-level time comparisons (e.g., (time() - start_ts) > 14400), since the eval environment doesn't contain 4h-old silences. This is disclosed: the task hint says "think about how Prometheus itself can encode the duration threshold," and the prompt says the lead "is going to check that it actually fires." Agents who probe Prometheus for available metrics succeed; agents who assume a metric exists without checking fail. That's a genuine skill gap, not unfairness.

Hardcoded String Checks

v10 removed the problematic checks from earlier versions (threshold pattern matching for 14400/4*3600/4h/240m, and the for: >= 60s duration requirement). The remaining string checks are:

  • Rule name (== "OnCallSilenceDurationTooLong") — exact match, task-specified verbatim
  • Expression contains "silence" — case-insensitive substring, very lenient (any silence-related metric passes; even a self-referential expression containing the alert name would match)
  • Rule state (pending/firing) — standard Prometheus API enum values, not a string hack

The grader now relies on functional outcome tests (does the expression evaluate? does the rule fire?) rather than pattern-matching the expression text. This is approach-agnostic within the constraint that the expression must be semantically meaningful against current data.


Advisory Notes (non-blocking)

  1. Information isolation (platform): 9/15 teapot runs accessed solution.sh/grader.py from the filesystem. Clean teapot mean (no leakage) is 0.558 with 0 perfect scores. This is a teapot harness issue, not a task issue — the Dockerfile does not COPY these files.

  2. Optional hint improvement: Consider adding a nudge toward discovering existing metrics ("check what metrics Grafana already exposes to Prometheus") to reduce the phantom-metric failure mode. The current 47% pass rate on rule_loaded_and_sane is reasonable, and the hint about "how Prometheus itself can encode the duration threshold" is already good. This is a judgment call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment