Verdict: NEEDS_WORK | Mean score 0.96 (threshold: <0.50 docker backend) | 8 of 9 scored runs perfect 1.0
Every fault follows the same diagnostic loop:
kubectl describe → see obvious mismatch → fix → next
Wrong port, wrong selector, wrong ConfigMap key, paused deployment, archived repo — each resolves in 1-2 tool calls. Twenty-eight of these is a long task, but not a difficult one. Quantity of shallow faults doesn't compound into difficulty.
What's needed is 2-3 faults where kubectl describe doesn't hand the agent the answer — problems requiring multi-step correlation. The config-policy-enforcer (F24) is already the best example: the agent must discover why fixes keep reverting. More faults at that level will move the score.
RBAC subject mismatch (setup.sh lines 63-66) — is this intentional?
The agent RBAC block creates RoleBindings with kind: User, name: ubuntu-user, but the agent authenticates as a ServiceAccount (system:serviceaccount:default:ubuntu-user). This isn't labeled as one of the F1-F28 faults.
- If unintentional: It's a setup bug. Run 9 hit it — the agent was locked out of all task namespaces for 69 minutes. The other 9 runs worked (k3s sometimes treats User and ServiceAccount subjects interchangeably), making this non-deterministic. Fix:
kind: ServiceAccount, namespace: default, apiGroup: "". - If intentional: RBAC subject-type troubleshooting is a legitimate DevOps skill and would be a good challenge. But it needs a fault label, and the non-determinism needs addressing — either it should reliably block access (so every run tests it) or reliably work (so it's not a lottery).
Note on Run 1 (stability check): The agent fixed HPA oscillation but left 1/3 replicas ready, and the grader failed it. The prompt says "remain stable... replicas should not oscillate" — a senior engineer seeing 1/3 Ready would likely keep investigating, so this is within fair bounds. But if you restructure subscores, consider whether the stability language could be slightly more explicit (e.g., "healthy with all desired replicas running") to remove any ambiguity.
The shallow bugs are worth keeping — they're realistic texture and consume agent turns. The problem is they're the only kind of difficulty. Layer 2-3 faults that require multi-step reasoning:
| Deeper fault | Why it's harder (fairly) |
|---|---|
| Sidecar re-injects stale credential on pod restart — fixing the Secret and restarting isn't enough | Agent must trace the credential lifecycle across resources |
| Chained config: fixing matchers surfaces inhibition rule → clearing inhibition reveals group_wait → reducing group_wait exposes catch-all route | Agent must iterate and retest, can't batch-apply |
| NetworkPolicy passes for service FQDN but DNS resolution itself is blocked by a different egress policy | Agent must reason about dependency chains |
| PodDisruptionBudget referencing pods in another namespace blocks a deployment | Agent must correlate across namespace boundaries |
Key principle: difficulty should come from diagnostic depth, not from hidden gotchas. A senior engineer should find these problems challenging for the same reasons.
Current check: .md file contains "alertmanager" and the names of both receivers. An agent can pass with a single line containing those keywords.
The fix isn't hidden grader expectations — it's making task.yaml ask for content that requires real investigation to produce. For example, task.yaml could say:
Your runbook must document: (1) the alert routing paths from firing to each notification channel, including severity-to-receiver mappings; (2) the exact commands to verify each channel is delivering; (3) what broke and how you resolved it.
Then the grader checks for things the agent would only know from actually doing the work:
- The real receiver names and endpoints from the AlertManager config
- Working
curloramtoolverification commands with correct endpoints - References to specific faults discovered (e.g., the config-policy-enforcer, the DNS egress policy)
This is fair — the prompt tells the agent exactly what to write. It's difficult — a boilerplate runbook won't contain environment-specific details, and an agent that didn't fully investigate can't produce them.
The config-policy-enforcer is the task's best fault, but it's discoverable too quickly as a named Deployment. Consider implementing it as a ValidatingWebhookConfiguration or MutatingAdmissionWebhook — realistic (OPA/Gatekeeper/Kyverno), and harder to trace because interference doesn't appear in kubectl get deployments. If you add more complex faults, verify the delivery poll wait times (currently 10 × 2s) still suffice.
| Run | notification_pipeline | runbook | stability | Score |
|---|---|---|---|---|
| 1 | pass | pass | fail | 0.67 |
| 2–8 | pass | pass | pass | 1.0 |
| 9 | (RBAC lockout — see Clarification Needed) | — | — | None |
| 10 | pass | pass | pass | 1.0 |
notification_pipeline and runbook never fail across any run. The only variance is stability (1 failure in 9 scored runs). Effective score range: [0.67, 1.0].
Preserve these through revisions:
- Mock webhook receivers with real auth validation — receivers check Bearer tokens and routing keys, so the agent must source correct credentials rather than guess
- Discovery endpoints — mock receivers expose API-compatible listing endpoints; rewards methodical investigation
- End-to-end delivery grading — actual alerts through the pipeline, not file-existence checks
- No answer leakage — setup artifacts include realistic noise without revealing solutions
- Config policy enforcer concept — non-obvious interference; template for deeper faults
- HPA deletion accepted as valid remediation — grader correctly treats this as valid (fixed from v4 feedback)
| Criterion | Status | Detail |
|---|---|---|
| Solvable | Pass (assumed) | solution.sh validates per author |
| Challenging | Fail | Mean 0.96, needs <0.50 (docker backend) |
| Substantial | Pass | 28-fault task with runbook deliverable is ≥4H scope |