Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active April 2, 2026 17:20
Show Gist options
  • Select an option

  • Save arubis/2cf7ee66dfa72efd19a671baa46ae72b to your computer and use it in GitHub Desktop.

Select an option

Save arubis/2cf7ee66dfa72efd19a671baa46ae72b to your computer and use it in GitHub Desktop.
Review: alertmanager-webhook-routing-failure v10 (3802d778)

Review: alertmanager-webhook-routing-failure v10

Verdict: NEEDS_WORK | Mean score 0.96 (threshold: <0.50 docker backend) | 8 of 9 scored runs perfect 1.0


Why It's Too Easy

Every fault follows the same diagnostic loop:

kubectl describe → see obvious mismatch → fix → next

Wrong port, wrong selector, wrong ConfigMap key, paused deployment, archived repo — each resolves in 1-2 tool calls. Twenty-eight of these is a long task, but not a difficult one. Quantity of shallow faults doesn't compound into difficulty.

What's needed is 2-3 faults where kubectl describe doesn't hand the agent the answer — problems requiring multi-step correlation. The config-policy-enforcer (F24) is already the best example: the agent must discover why fixes keep reverting. More faults at that level will move the score.


Clarification Needed

RBAC subject mismatch (setup.sh lines 63-66) — is this intentional?

The agent RBAC block creates RoleBindings with kind: User, name: ubuntu-user, but the agent authenticates as a ServiceAccount (system:serviceaccount:default:ubuntu-user). This isn't labeled as one of the F1-F28 faults.

  • If unintentional: It's a setup bug. Run 9 hit it — the agent was locked out of all task namespaces for 69 minutes. The other 9 runs worked (k3s sometimes treats User and ServiceAccount subjects interchangeably), making this non-deterministic. Fix: kind: ServiceAccount, namespace: default, apiGroup: "".
  • If intentional: RBAC subject-type troubleshooting is a legitimate DevOps skill and would be a good challenge. But it needs a fault label, and the non-determinism needs addressing — either it should reliably block access (so every run tests it) or reliably work (so it's not a lottery).

Note on Run 1 (stability check): The agent fixed HPA oscillation but left 1/3 replicas ready, and the grader failed it. The prompt says "remain stable... replicas should not oscillate" — a senior engineer seeing 1/3 Ready would likely keep investigating, so this is within fair bounds. But if you restructure subscores, consider whether the stability language could be slightly more explicit (e.g., "healthy with all desired replicas running") to remove any ambiguity.


Recommendations

1. Add deeper bugs on top of the existing ones

The shallow bugs are worth keeping — they're realistic texture and consume agent turns. The problem is they're the only kind of difficulty. Layer 2-3 faults that require multi-step reasoning:

Deeper fault Why it's harder (fairly)
Sidecar re-injects stale credential on pod restart — fixing the Secret and restarting isn't enough Agent must trace the credential lifecycle across resources
Chained config: fixing matchers surfaces inhibition rule → clearing inhibition reveals group_wait → reducing group_wait exposes catch-all route Agent must iterate and retest, can't batch-apply
NetworkPolicy passes for service FQDN but DNS resolution itself is blocked by a different egress policy Agent must reason about dependency chains
PodDisruptionBudget referencing pods in another namespace blocks a deployment Agent must correlate across namespace boundaries

Key principle: difficulty should come from diagnostic depth, not from hidden gotchas. A senior engineer should find these problems challenging for the same reasons.

2. Harden the runbook check

Current check: .md file contains "alertmanager" and the names of both receivers. An agent can pass with a single line containing those keywords.

The fix isn't hidden grader expectations — it's making task.yaml ask for content that requires real investigation to produce. For example, task.yaml could say:

Your runbook must document: (1) the alert routing paths from firing to each notification channel, including severity-to-receiver mappings; (2) the exact commands to verify each channel is delivering; (3) what broke and how you resolved it.

Then the grader checks for things the agent would only know from actually doing the work:

  • The real receiver names and endpoints from the AlertManager config
  • Working curl or amtool verification commands with correct endpoints
  • References to specific faults discovered (e.g., the config-policy-enforcer, the DNS egress policy)

This is fair — the prompt tells the agent exactly what to write. It's difficult — a boilerplate runbook won't contain environment-specific details, and an agent that didn't fully investigate can't produce them.

3. Make the enforcer a real admission controller

The config-policy-enforcer is the task's best fault, but it's discoverable too quickly as a named Deployment. Consider implementing it as a ValidatingWebhookConfiguration or MutatingAdmissionWebhook — realistic (OPA/Gatekeeper/Kyverno), and harder to trace because interference doesn't appear in kubectl get deployments. If you add more complex faults, verify the delivery poll wait times (currently 10 × 2s) still suffice.


Eval Data

Run notification_pipeline runbook stability Score
1 pass pass fail 0.67
2–8 pass pass pass 1.0
9 (RBAC lockout — see Clarification Needed) None
10 pass pass pass 1.0

notification_pipeline and runbook never fail across any run. The only variance is stability (1 failure in 9 scored runs). Effective score range: [0.67, 1.0].


What Already Works

Preserve these through revisions:

  • Mock webhook receivers with real auth validation — receivers check Bearer tokens and routing keys, so the agent must source correct credentials rather than guess
  • Discovery endpoints — mock receivers expose API-compatible listing endpoints; rewards methodical investigation
  • End-to-end delivery grading — actual alerts through the pipeline, not file-existence checks
  • No answer leakage — setup artifacts include realistic noise without revealing solutions
  • Config policy enforcer concept — non-obvious interference; template for deeper faults
  • HPA deletion accepted as valid remediation — grader correctly treats this as valid (fixed from v4 feedback)

Summary

Criterion Status Detail
Solvable Pass (assumed) solution.sh validates per author
Challenging Fail Mean 0.96, needs <0.50 (docker backend)
Substantial Pass 28-fault task with runbook deliverable is ≥4H scope
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment