UUID: dfbb871c-3e00-4a93-b7ee-95f950b35d8b
Author: hafis_83579
Category: cloud-ops
Difficulty: hard
Reviewer: Dylan (@daltoris)
The core task design is strong — the diagnosis chain, decoy pattern, and RBAC-gated Secret discovery are excellent. Two issues block acceptance:
- Grader bug:
check_grafana_dashboard_panels()has an early-exit bug that fails 7/10 teapot runs whose dashboards are correct. Fixing it alone would push the mean from 0.67 to ~0.87 (above the 0.85 ceiling). - Observability too easy: With the bug fixed, the dashboard check passes ~10/10 because metric matching is overly loose. This group needs genuine difficulty to keep the score in range.
This is a grader fix + observability tightening, not a redesign.
| Criterion | Threshold | Current | Post-fix (est.) | Status |
|---|---|---|---|---|
| Solvable | solution.sh passes | Author confirms | — | PASS |
| Challenging (teapot) | mean ≤0.85 | 0.67 (10 runs) | ~0.87 | NEEDS_WORK |
| Challenging (docker) | mean <0.50 | 0.33 (20 runs) | ~0.40 | PASS |
| Substantial | ≥4H | Multi-phase diagnosis + remediation + observability | — | PASS |
The bug (grader.py lines 296-303): The grader iterates all ConfigMaps with label grafana_dashboard=1 in alphabetical order. The pre-existing Nebula dashboard grafana-dashboard-k8s-cluster contains kube_pod_container_status_restarts_total but no node-condition metric, so the grader evaluates has_pod=True, has_node=False and hits return False — never checking the agent's dashboard that sorts later alphabetically.
The proof: The correlation is perfect across all 30 runs (10 teapot, 20 docker):
- Agent names dashboard
bleater-*→ sorts beforek→ checked first → always passes (3/3 teapot) - Agent names dashboard
node-*→ sorts afterk→ pre-existing CM checked first → always fails (7/7 teapot)
All 7 failing agents created dashboards with both required metric categories. Verified in transcripts.
The fix: Don't return False inside the loop. Accumulate failures and keep checking:
best_failure = None
for cm in items:
name = (cm.get("metadata") or {}).get("name", "")
for val in (cm.get("data") or {}).values():
if not val:
continue
vl = val.lower()
has_node = (...)
has_pod = (...)
if has_node and has_pod:
return True, f"Dashboard '{name}' has node-condition + pod-restart panels"
if not best_failure:
if has_node and not has_pod:
best_failure = f"Dashboard '{name}': has node panel but missing pod eviction/restart metric"
elif has_pod and not has_node:
best_failure = f"Dashboard '{name}': has pod panel but missing node condition metric"
return False, best_failure or "No dashboard found with both required metrics"With the bug fixed, observability passes ~10/10 because "restart" in val.lower() matches any dashboard containing the word "restart" in a title or comment. The task asks for "panels that correlate node condition transitions with pod eviction and restart counts" but the grader only checks that both keyword categories appear somewhere in the ConfigMap text.
Recommended: Drop the overly generic substring matches and require actual kube-state-metrics metric names:
# Pod metric — drop bare "restart" (matches titles/comments, not metrics)
has_pod = ("kube_pod_container_status_restarts" in vl or "pod_restarts" in vl
or "eviction" in vl)
# Node metric — drop bare "unreachable" and "node_condition" (too generic)
has_node = ("kube_node_spec_taint" in vl or "kube_node_status_condition" in vl)Fairness: kube_node_spec_taint and kube_node_status_condition are standard kube-state-metrics names, discoverable via the Prometheus UI or metric API. No insider knowledge required — this is what a senior DevOps engineer would use.
Fairness guardrails (things NOT to tighten):
- Don't require
rate()/increase()— raw counters in Stat/Table panels are valid - Don't require both metrics in the same panel — side-by-side panels with aligned time axes is standard Grafana practice (the task itself says "panels", plural)
- Don't require specific panel types or JSON structure
If metric tightening alone isn't enough: Consider adding a check for a templated time-range variable ($__range, timeFrom, or a dashboard variable) — standard Grafana practice, discoverable, doesn't constrain structural choices.
Fallback: If tightening doesn't create enough variance after re-eval, merging observability with remediation into a single scoring group is an acceptable path to bring the mean into threshold. This sacrifices one dimension of scoring variance, so prefer tightening first.
| Run | Score | Diagnosis | Remediation | Observability | Notes |
|---|---|---|---|---|---|
| 1 | 1.00 | PASS | PASS | PASS | dashboard bleater-* (sorts before k) |
| 2 | 0.67 | PASS | PASS | FAIL |
dashboard node-* (sorts after k) |
| 3 | 0.67 | PASS | PASS | FAIL |
dashboard node-* |
| 4 | 0.33 | FAIL | PASS | FAIL |
node-*; also missed Secret |
| 5 | 0.67 | PASS | PASS | FAIL |
node-* |
| 6 | 0.67 | PASS | PASS | FAIL |
node-* |
| 7 | 1.00 | PASS | PASS | PASS | dashboard bleater-* |
| 8 | 0.66 | FAIL | PASS | PASS | bleater-*; missed Secret |
| 9 | 0.67 | PASS | PASS | FAIL |
node-* |
| 10 | 0.33 | FAIL | PASS | FAIL |
node-*; missed Secret |
Group pass rates: diagnosis 7/10, remediation 10/10, observability 3/10 (all 7 failures are the grader bug).
The bottleneck is secret_cap_raised: discovering the kube-system Secret toleration-enforcer-config. RBAC grants get (not list) on this specific Secret name, forcing agents to infer its existence.
- Passing agents (7/10): use
kubectl auth can-i --listto discover available resources - Failing agents (3/10): try
kubectl get secrets -n kube-system→ 403 → abandon kube-system entirely - Near-miss (run 4): found the Secret name in RBAC role output but didn't connect it to trying a direct
get
This is the task's best difficulty signal — real K8s expertise, diverse failure modes, healthy 70% pass rate.
All agents patch both Deployment templates with both NoExecute tolerations.
Every agent creates a structurally valid dashboard with correct metrics. The 7 failures are entirely caused by the grader bug. With the bug fixed, this group needs tightened checks to carry genuine difficulty.
- Secret discovery via RBAC
resourceNames— the best part of this task.get-without-listis a genuinely hard K8s pattern with diverse agent failure modes. - Decoy/real-source pattern — the bleater ConfigMap (300s) vs kube-system Secret (180s) is well-designed. All agents see through the decoy but it adds realistic investigation depth.
- Investigation chain — webhook annotations → service name → infer Secret → RBAC-gated get. Multiple valid discovery paths.
- Webhook ambiguity as difficulty — the ValidatingWebhookConfiguration (
failurePolicy: Fail, no backing service) forces agents to reason about what "admission control enforcement" means before acting. All 10 agents resolve it, but the 15-25 turns of deliberation test real K8s judgment. - Clean setup — no information leaks, good misdirection events, clean Dockerfile.
- Cohesive narrative — natural SRE arc from diagnosis through remediation to observability.
- Task says "fires when taint active for more than 120 seconds" but grader accepts
>= 120sand solution usesfor: 2m(= 120s). Borderline but not blocking. - All 9 grader checks map to explicit task.yaml requirements; weights are equal (0.34+0.33+0.33); can't score >0.3 without meaningful progress. Grader is solid aside from the dashboard bug.