Review: `alertmanager-webhook-routing-failure` v10

Verdict: NEEDS_WORK | Mean score 0.96 (threshold: <0.50 docker backend) | 8 of 9 scored runs perfect 1.0

Why It's Too Easy

Every fault follows the same diagnostic loop:

kubectl describe → see obvious mismatch → fix → next

Wrong port, wrong selector, wrong ConfigMap key, paused deployment, archived repo — each resolves in 1-2 tool calls. Twenty-eight of these is a long task, but not a difficult one. Quantity of shallow faults doesn't compound into difficulty.

What's needed is 2-3 faults where kubectl describe doesn't hand the agent the answer — problems requiring multi-step correlation. The config-policy-enforcer (F24) is already the best example: the agent must discover why fixes keep reverting. More faults at that level will move the score.

Clarification Needed

RBAC subject mismatch (setup.sh lines 63-66) — is this intentional?

The agent RBAC block creates RoleBindings with kind: User, name: ubuntu-user, but the agent authenticates as a ServiceAccount (system:serviceaccount:default:ubuntu-user). This isn't labeled as one of the F1-F28 faults.

If unintentional: It's a setup bug. Run 9 hit it — the agent was locked out of all task namespaces for 69 minutes. The other 9 runs worked (k3s sometimes treats User and ServiceAccount subjects interchangeably), making this non-deterministic. Fix: kind: ServiceAccount, namespace: default, apiGroup: "".
If intentional: RBAC subject-type troubleshooting is a legitimate DevOps skill and would be a good challenge. But it needs a fault label, and the non-determinism needs addressing — either it should reliably block access (so every run tests it) or reliably work (so it's not a lottery).

Note on Run 1 (stability check): The agent fixed HPA oscillation but left 1/3 replicas ready, and the grader failed it. The prompt says "remain stable... replicas should not oscillate" — a senior engineer seeing 1/3 Ready would likely keep investigating, so this is within fair bounds. But if you restructure subscores, consider whether the stability language could be slightly more explicit (e.g., "healthy with all desired replicas running") to remove any ambiguity.

Recommendations

1. Add deeper bugs on top of the existing ones

The shallow bugs are worth keeping — they're realistic texture and consume agent turns. The problem is they're the only kind of difficulty. Layer 2-3 faults that require multi-step reasoning:

Deeper fault	Why it's harder (fairly)
Sidecar re-injects stale credential on pod restart — fixing the Secret and restarting isn't enough	Agent must trace the credential lifecycle across resources
Chained config: fixing matchers surfaces inhibition rule → clearing inhibition reveals group_wait → reducing group_wait exposes catch-all route	Agent must iterate and retest, can't batch-apply
NetworkPolicy passes for service FQDN but DNS resolution itself is blocked by a different egress policy	Agent must reason about dependency chains
PodDisruptionBudget referencing pods in another namespace blocks a deployment	Agent must correlate across namespace boundaries

Key principle: difficulty should come from diagnostic depth, not from hidden gotchas. A senior engineer should find these problems challenging for the same reasons.

2. Harden the runbook check

Current check: .md file contains "alertmanager" and the names of both receivers. An agent can pass with a single line containing those keywords.

The fix isn't hidden grader expectations — it's making task.yaml ask for content that requires real investigation to produce. For example, task.yaml could say:

Your runbook must document: (1) the alert routing paths from firing to each notification channel, including severity-to-receiver mappings; (2) the exact commands to verify each channel is delivering; (3) what broke and how you resolved it.

Then the grader checks for things the agent would only know from actually doing the work:

The real receiver names and endpoints from the AlertManager config
Working curl or amtool verification commands with correct endpoints
References to specific faults discovered (e.g., the config-policy-enforcer, the DNS egress policy)

This is fair — the prompt tells the agent exactly what to write. It's difficult — a boilerplate runbook won't contain environment-specific details, and an agent that didn't fully investigate can't produce them.

3. Make the enforcer a real admission controller

The config-policy-enforcer is the task's best fault, but it's discoverable too quickly as a named Deployment. Consider implementing it as a ValidatingWebhookConfiguration or MutatingAdmissionWebhook — realistic (OPA/Gatekeeper/Kyverno), and harder to trace because interference doesn't appear in kubectl get deployments. If you add more complex faults, verify the delivery poll wait times (currently 10 × 2s) still suffice.

Eval Data

Run	notification_pipeline	runbook	stability	Score
1	pass	pass	fail	0.67
2–8	pass	pass	pass	1.0
9	(RBAC lockout — see Clarification Needed)	—	—	None
10	pass	pass	pass	1.0

notification_pipeline and runbook never fail across any run. The only variance is stability (1 failure in 9 scored runs). Effective score range: [0.67, 1.0].

What Already Works

Preserve these through revisions:

Mock webhook receivers with real auth validation — receivers check Bearer tokens and routing keys, so the agent must source correct credentials rather than guess
Discovery endpoints — mock receivers expose API-compatible listing endpoints; rewards methodical investigation
End-to-end delivery grading — actual alerts through the pipeline, not file-existence checks
No answer leakage — setup artifacts include realistic noise without revealing solutions
Config policy enforcer concept — non-obvious interference; template for deeper faults
HPA deletion accepted as valid remediation — grader correctly treats this as valid (fixed from v4 feedback)

Summary

Criterion	Status	Detail
Solvable	Pass (assumed)	solution.sh validates per author
Challenging	Fail	Mean 0.96, needs <0.50 (docker backend)
Substantial	Pass	28-fault task with runbook deliverable is ≥4H scope

arubis/alertmanager-webhook-routing-review-v10.md

Select an option

No results found

Select an option

No results found

Review: `alertmanager-webhook-routing-failure` v10

Why It's Too Easy

Clarification Needed

Recommendations

1. Add deeper bugs on top of the existing ones

2. Harden the runbook check

3. Make the enforcer a real admission controller

Eval Data

What Already Works

Summary

arubis/alertmanager-webhook-routing-review-v10.md

Review: alertmanager-webhook-routing-failure v10

Why It's Too Easy

Clarification Needed

Recommendations

1. Add deeper bugs on top of the existing ones

2. Harden the runbook check

3. Make the enforcer a real admission controller

Eval Data

What Already Works

Summary

Review: `alertmanager-webhook-routing-failure` v10