Review: `statping-check-cascade` (v49)

UUID: 16a06e6e-2bd9-4da1-8930-34ab1aef5f59 Author: @hafis_83579 Reviewer: Dylan Fitzgerald Date: 2026-04-13 Discord: https://discord.com/channels/1427397917685321919/1485734279571832883

Verdict: 🟡 NEEDS_WORK

Solid task concept with a genuine multi-layered puzzle (COALESCE-based DB overrides → disabled service → wrong method/status → suspended CronJob → NetworkPolicy → incident doc). Solvable locally at 1.0. Two fixable design issues and one platform anomaly block acceptance. See also: difficulty-ratcheting recommendations at the bottom, which anticipate the post-fix pass rate.

Eval Summary

Backend	Rollouts	Avg	Full-pass	Threshold	Meets?
biggie-nebula	5	0.46	1/5 (20%)	<0.50 docker	✅
nighthawk (teapot)	5	0.33	0/5 (0%)	≤0.85 teapot	✅ nominally — see caveat

All failures across both backends share the same pattern: compliance=0, core_db=0, restoration=1. Zero intra-failure subscore variance — compliance and core_db always move together, confirming they're gated by the same np_resolved() function.

Failure Analysis

Biggie-nebula (4/5 fail)

All 4 failing runs took the pod-labeling shortcut — added role: monitor-approved to the statping-ng pod rather than patching the NetworkPolicy.
Run4 (the one pass, 4 min, 93 msgs): used kubectl patch networkpolicy ... --type=json to append app: statping-ng to ingress.from. Clean, methodical.
These are borderline-genuine failures — the grader's rejection of pod-labeling is defensible, but the prompt doesn't clearly signal "modify the object, not the pod."

Nighthawk (5/5 fail) — the surprising result

runs 1, 3, 5: correctly patched the NP (tool output confirms ingress[0].from = [{role:monitor-approved}, {app:statping-ng}]) — yet scored compliance=0. Grader's substring check "statping-ng" in ingress_str should have matched.
runs 2, 4: pod-labeled only — genuine failures.
Local horizon test-solution scores 1.0 → grader is correct. Evidence points to nighthawk backend grader-environment isolation (agent modifies one cluster state, grader reads another). Platform issue, not a task defect.

Issues (task-level — fixable by author)

1. Double-counting NP resolution (Grader defect, medium severity)

np_resolved() is called in both check_compliance (0.34) AND check_core_db (0.33). A pod-labeling agent loses 0.67 points for one architectural choice instead of 0.34. The comment in check_core_db justifies it as "prevents k3s gaming," but the penalty is disproportionate — no rollout ever achieved compliance=0, core_db=1 or vice versa. They always move together, confirming the gate is redundant signal.

Fix: Remove np_resolved() from check_core_db (leave only online=1). Keeps 3-subscore structure; eliminates double-penalty.

2. Prompt-grader alignment on NP resolution (Task design, medium severity)

Task says: "Ensure namespace network policies do not restrict statping-ng's access." This is outcome-oriented; pod-labeling literally satisfies it (the NP was designed with a role: monitor-approved allow-rule, making pod-labeling a natural idiomatic K8s pattern). Grader demands policy-object modification.

Fix (preferred, preserves difficulty): Add to task.yaml:

"Resolve the NetworkPolicy itself — modify or delete the NetworkPolicy resource; do not work around it by labeling pods."

3. `restoration` has low variance (Subscore quality, low severity)

Passes 10/10 across both backends. CronJob unsuspend + "any ConfigMap with keyword 'override'" is trivially achievable. See ratcheting section below for a principled tightening.

Issue (platform-level — escalate separately)

4. Nighthawk grader-environment isolation

3/5 nighthawk runs correctly patched the NP per grader spec yet scored 0. Local docker solvability confirms the grader is correct. This looks like a nighthawk-backend platform defect (grader doesn't see agent's kubectl changes).

Action: Do not block task on this. Flag to platform team with transcript references:

eval_transcripts/16a06e6e-nighthawk/run{1,3,5}-batch1.json

Positives

Well-instrumented setup.sh with clean DB seeding via env DB_SQL=... python3 -c helper pattern
Multi-root-cause diagnosis (DB overrides + wrong service fields + NP + CronJob + docs) is substantive
Solution.sh is clear, passes locally at 1.0
Grader uses partial credit appropriately (0.00 / 0.33 / 0.34 / 0.66 / 0.67 / 1.00 scoring ladder)
Equal-ish weights (0.34 / 0.33 / 0.33), no answer leakage in grader, no hardcoded values
4H+ scope easily met

Difficulty Ratcheting (anticipating post-fix trajectory)

Post-fix pass-rate projection

After #1 + #2 above, nighthawk would likely jump to 3/5 → 5/5 pass. The 3 currently-correctly-patched nighthawk runs would pass once the platform issue is resolved, and the 2 pod-labelers would convert once the prompt is clear. You need genuine ratchets, not just disambiguation, or teapot will full-pass.

Four fair ratchets, ranked by ROI

A. Require NP preservation, not deletion (biggest single win)

Currently grader accepts both "delete NP" and "patch NP to add statping-ng". solution.sh takes the easy path (kubectl delete). Change grader to reject deletion and require the policy to remain with role: monitor-approved intact PLUS app: statping-ng appended.

Add to prompt:

"The security team's NetworkPolicy must remain in force for the rest of the namespace. Modify it to allow statping-ng while preserving the existing role: monitor-approved allow-rule — do not delete it, and do not work around it by labeling pods."

Why fair: explicit constraint in prompt, defensible rationale (prod security posture). Why hard: requires JSON-patch skills, tests production discipline vs. "nuke and pave". Implementation: np_resolved() already checks for "statping-ng" in ingress; also verify role/monitor-approved still present. Drop the "deleted" acceptance branch.

B. Tighten incident report check: keyword-bingo → enumerated root causes

Current check_restoration: any single keyword (override, authoritative, coalesce, check_adjustments) passes. Gameable and provides zero variance.

Change to N-of-M root-cause markers (e.g., ≥4 of 5):

Identifies check_adjustments table specifically (not just "override")
Mentions enabled=0 or "disabled service" for timeline-service
Mentions wrong method / expected_status for fanout-service
Identifies SQLite as authoritative source (vs. ConfigMap decoys)
References the NetworkPolicy fix

Rename to incident_documentation.

Why fair: the prompt already asks for "root cause(s) for each failing service" and "which source was authoritative" — you're just grading what you asked for. Why hard: tests whether the agent actually understood the layered failure vs. wrote a generic report.

C. Second suspended workload (inventory thoroughness)

Add a second suspended resource: e.g., a DaemonSet named statping-reconciler with replicas=0, or a second CronJob like statping-alert-resolver that clears stale alerts. Prompt currently says "A workload has been suspended" — change to "Workloads have been suspended" (plural) and check both.

Why fair: prompt signals plurality; agents should inventory thoroughly. Why hard: cheap to miss one, easy to find if you sweep kubectl get cronjobs,deployments,daemonsets,statefulsets.

D. Stability window on `core_db`

Currently core_db passes on the first cycle both services show online=1. Add: both services must remain online=1 for 3 consecutive check cycles (~90s). Store observations between grader polls.

Why fair: prompt says "mark as UP in statping-ng's check results" — stability is implicit. Why hard: flaky fixes (service-selector thrashing, which several nighthawk runs exhibited) would fail. Forces agents to verify before declaring victory.

What NOT to do

Don't add hidden traps (triggers that revert fixes, ConfigMap webhooks that re-break things). Agents have no prompt signal to look for these → unfair.
Don't require the CronJob to be the fix mechanism for check_adjustments. Manual DELETE is a valid workflow, and the prompt doesn't forbid it.
Don't add more label-collision puzzles — transcripts show agents already burn significant time on the bleater-* shadow deployments. Adding more hurts signal quality.

Recommended stack

Apply A + B together first. That alone probably pushes teapot pass rate from hypothetical 5/5 back down to ~2/5 (0.5-ish average), landing in the productive challenge zone. Hold C and D in reserve if teapot still aces it after round 1.

Recommendations (priority order)

Remove NP gate from check_core_db (fix #1) — eliminates 0.67-point double penalty
Add explicit "preserve the NP, do not pod-label" prompt language (fix #2 + ratchet A combined)
Tighten restoration → incident_documentation with N-of-M root-cause markers (fix #3 + ratchet B combined)
Report nighthawk eval anomaly to platform team separately (issue #4)
(Reserve): second suspended workload + stability window if pass rate still too high after 1–3

After these, expected teapot pass rate lands in the 0.4–0.7 range — difficult but fair, with subscore variance that produces useful training signal.

arubis/statping-check-cascade-review.md

Select an option

No results found

Select an option

No results found

Review: `statping-check-cascade` (v49)

Verdict: 🟡 NEEDS_WORK

Eval Summary

Failure Analysis

Biggie-nebula (4/5 fail)

Nighthawk (5/5 fail) — the surprising result

Issues (task-level — fixable by author)

1. Double-counting NP resolution (Grader defect, medium severity)

2. Prompt-grader alignment on NP resolution (Task design, medium severity)

3. `restoration` has low variance (Subscore quality, low severity)

Issue (platform-level — escalate separately)

4. Nighthawk grader-environment isolation

Positives

Difficulty Ratcheting (anticipating post-fix trajectory)

Post-fix pass-rate projection

Four fair ratchets, ranked by ROI

A. Require NP preservation, not deletion (biggest single win)

B. Tighten incident report check: keyword-bingo → enumerated root causes

C. Second suspended workload (inventory thoroughness)

D. Stability window on `core_db`

What NOT to do

Recommended stack

Recommendations (priority order)

arubis/statping-check-cascade-review.md

Review: statping-check-cascade (v49)

Verdict: 🟡 NEEDS_WORK

Eval Summary

Failure Analysis

Biggie-nebula (4/5 fail)

Nighthawk (5/5 fail) — the surprising result

Issues (task-level — fixable by author)

1. Double-counting NP resolution (Grader defect, medium severity)

2. Prompt-grader alignment on NP resolution (Task design, medium severity)

3. restoration has low variance (Subscore quality, low severity)

Issue (platform-level — escalate separately)

4. Nighthawk grader-environment isolation

Positives

Difficulty Ratcheting (anticipating post-fix trajectory)

Post-fix pass-rate projection

Four fair ratchets, ranked by ROI

A. Require NP preservation, not deletion (biggest single win)

B. Tighten incident report check: keyword-bingo → enumerated root causes

C. Second suspended workload (inventory thoroughness)

D. Stability window on core_db

What NOT to do

Recommended stack

Recommendations (priority order)

Review: `statping-check-cascade` (v49)

3. `restoration` has low variance (Subscore quality, low severity)

D. Stability window on `core_db`