Task Review: NodeTaint-NoExecute-Toleration-ExpiryRace (v10)

UUID: dfbb871c-3e00-4a93-b7ee-95f950b35d8b
Author: hafis_83579
Category: cloud-ops
Difficulty: hard
Reviewer: Dylan (@daltoris)

Verdict: NEEDS_WORK

The core task design is strong — the diagnosis chain, decoy pattern, and RBAC-gated Secret discovery are excellent. Two issues block acceptance:

Grader bug: check_grafana_dashboard_panels() has an early-exit bug that fails 7/10 teapot runs whose dashboards are correct. Fixing it alone would push the mean from 0.67 to ~0.87 (above the 0.85 ceiling).
Observability too easy: With the bug fixed, the dashboard check passes ~10/10 because metric matching is overly loose. This group needs genuine difficulty to keep the score in range.

This is a grader fix + observability tightening, not a redesign.

Criterion	Threshold	Current	Post-fix (est.)	Status
Solvable	solution.sh passes	Author confirms	—	PASS
Challenging (teapot)	mean ≤0.85	0.67 (10 runs)	~0.87	NEEDS_WORK
Challenging (docker)	mean <0.50	0.33 (20 runs)	~0.40	PASS
Substantial	≥4H	Multi-phase diagnosis + remediation + observability	—	PASS

What to Fix

1. Grader early-exit bug in dashboard check (blocking)

The bug (grader.py lines 296-303): The grader iterates all ConfigMaps with label grafana_dashboard=1 in alphabetical order. The pre-existing Nebula dashboard grafana-dashboard-k8s-cluster contains kube_pod_container_status_restarts_total but no node-condition metric, so the grader evaluates has_pod=True, has_node=False and hits return False — never checking the agent's dashboard that sorts later alphabetically.

The proof: The correlation is perfect across all 30 runs (10 teapot, 20 docker):

Agent names dashboard bleater-* → sorts before k → checked first → always passes (3/3 teapot)
Agent names dashboard node-* → sorts after k → pre-existing CM checked first → always fails (7/7 teapot)

All 7 failing agents created dashboards with both required metric categories. Verified in transcripts.

The fix: Don't return False inside the loop. Accumulate failures and keep checking:

best_failure = None
for cm in items:
    name = (cm.get("metadata") or {}).get("name", "")
    for val in (cm.get("data") or {}).values():
        if not val:
            continue
        vl = val.lower()
        has_node = (...)
        has_pod = (...)
        if has_node and has_pod:
            return True, f"Dashboard '{name}' has node-condition + pod-restart panels"
        if not best_failure:
            if has_node and not has_pod:
                best_failure = f"Dashboard '{name}': has node panel but missing pod eviction/restart metric"
            elif has_pod and not has_node:
                best_failure = f"Dashboard '{name}': has pod panel but missing node condition metric"
return False, best_failure or "No dashboard found with both required metrics"

2. Tighten observability to restore difficulty (blocking)

With the bug fixed, observability passes ~10/10 because "restart" in val.lower() matches any dashboard containing the word "restart" in a title or comment. The task asks for "panels that correlate node condition transitions with pod eviction and restart counts" but the grader only checks that both keyword categories appear somewhere in the ConfigMap text.

Recommended: Drop the overly generic substring matches and require actual kube-state-metrics metric names:

# Pod metric — drop bare "restart" (matches titles/comments, not metrics)
has_pod = ("kube_pod_container_status_restarts" in vl or "pod_restarts" in vl
           or "eviction" in vl)

# Node metric — drop bare "unreachable" and "node_condition" (too generic)
has_node = ("kube_node_spec_taint" in vl or "kube_node_status_condition" in vl)

Fairness: kube_node_spec_taint and kube_node_status_condition are standard kube-state-metrics names, discoverable via the Prometheus UI or metric API. No insider knowledge required — this is what a senior DevOps engineer would use.

Fairness guardrails (things NOT to tighten):

Don't require rate()/increase() — raw counters in Stat/Table panels are valid
Don't require both metrics in the same panel — side-by-side panels with aligned time axes is standard Grafana practice (the task itself says "panels", plural)
Don't require specific panel types or JSON structure

If metric tightening alone isn't enough: Consider adding a check for a templated time-range variable ($__range, timeFrom, or a dashboard variable) — standard Grafana practice, discoverable, doesn't constrain structural choices.

Fallback: If tightening doesn't create enough variance after re-eval, merging observability with remediation into a single scoring group is an acceptable path to bring the mean into threshold. This sacrifices one dimension of scoring variance, so prefer tightening first.

Evidence: Eval Analysis

Teapot — 10 runs, mean 0.67

Run	Score	Diagnosis	Remediation	Observability	Notes
1	1.00	PASS	PASS	PASS	dashboard `bleater-*` (sorts before `k`)
2	0.67	PASS	PASS	FAIL ⚠️	dashboard `node-*` (sorts after `k`)
3	0.67	PASS	PASS	FAIL ⚠️	dashboard `node-*`
4	0.33	FAIL	PASS	FAIL ⚠️	`node-*`; also missed Secret
5	0.67	PASS	PASS	FAIL ⚠️	`node-*`
6	0.67	PASS	PASS	FAIL ⚠️	`node-*`
7	1.00	PASS	PASS	PASS	dashboard `bleater-*`
8	0.66	FAIL	PASS	PASS	`bleater-*`; missed Secret
9	0.67	PASS	PASS	FAIL ⚠️	`node-*`
10	0.33	FAIL	PASS	FAIL ⚠️	`node-*`; missed Secret

⚠️ = false failure (grader bug). Docker (20 runs, mean 0.33): same pattern — 16x0.33, 2x0.66, 2x0.00.

Group pass rates: diagnosis 7/10, remediation 10/10, observability 3/10 (all 7 failures are the grader bug).

Diagnosis group — genuine difficulty

The bottleneck is secret_cap_raised: discovering the kube-system Secret toleration-enforcer-config. RBAC grants get (not list) on this specific Secret name, forcing agents to infer its existence.

Passing agents (7/10): use kubectl auth can-i --list to discover available resources
Failing agents (3/10): try kubectl get secrets -n kube-system → 403 → abandon kube-system entirely
Near-miss (run 4): found the Secret name in RBAC role output but didn't connect it to trying a direct get

This is the task's best difficulty signal — real K8s expertise, diverse failure modes, healthy 70% pass rate.

Remediation group — appropriately easy (10/10)

All agents patch both Deployment templates with both NoExecute tolerations.

Observability group — artificial failures only

Every agent creates a structurally valid dashboard with correct metrics. The 7 failures are entirely caused by the grader bug. With the bug fixed, this group needs tightened checks to carry genuine difficulty.

What's Working Well

Secret discovery via RBAC resourceNames — the best part of this task. get-without-list is a genuinely hard K8s pattern with diverse agent failure modes.
Decoy/real-source pattern — the bleater ConfigMap (300s) vs kube-system Secret (180s) is well-designed. All agents see through the decoy but it adds realistic investigation depth.
Investigation chain — webhook annotations → service name → infer Secret → RBAC-gated get. Multiple valid discovery paths.
Webhook ambiguity as difficulty — the ValidatingWebhookConfiguration (failurePolicy: Fail, no backing service) forces agents to reason about what "admission control enforcement" means before acting. All 10 agents resolve it, but the 15-25 turns of deliberation test real K8s judgment.
Clean setup — no information leaks, good misdirection events, clean Dockerfile.
Cohesive narrative — natural SRE arc from diagnosis through remediation to observability.

Minor Notes

Task says "fires when taint active for more than 120 seconds" but grader accepts >= 120s and solution uses for: 2m (= 120s). Borderline but not blocking.
All 9 grader checks map to explicit task.yaml requirements; weights are equal (0.34+0.33+0.33); can't score >0.3 without meaningful progress. Grader is solid aside from the dashboard bug.

arubis/review-NodeTaint-NoExecute-Toleration-ExpiryRace.md

Select an option

No results found