Skip to content

Instantly share code, notes, and snippets.

@arubis
Created April 13, 2026 19:07
Show Gist options
  • Select an option

  • Save arubis/c64ea45fb49f920843c8e2277d380a42 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/c64ea45fb49f920843c8e2277d380a42 to your computer and use it in GitHub Desktop.
Review: statping-check-cascade (16a06e6e) v49 — NEEDS_WORK + difficulty ratcheting

Review: statping-check-cascade (v49)

UUID: 16a06e6e-2bd9-4da1-8930-34ab1aef5f59 Author: @hafis_83579 Reviewer: Dylan Fitzgerald Date: 2026-04-13 Discord: https://discord.com/channels/1427397917685321919/1485734279571832883

Verdict: 🟡 NEEDS_WORK

Solid task concept with a genuine multi-layered puzzle (COALESCE-based DB overrides → disabled service → wrong method/status → suspended CronJob → NetworkPolicy → incident doc). Solvable locally at 1.0. Two fixable design issues and one platform anomaly block acceptance. See also: difficulty-ratcheting recommendations at the bottom, which anticipate the post-fix pass rate.


Eval Summary

Backend Rollouts Avg Full-pass Threshold Meets?
biggie-nebula 5 0.46 1/5 (20%) <0.50 docker
nighthawk (teapot) 5 0.33 0/5 (0%) ≤0.85 teapot ✅ nominally — see caveat

All failures across both backends share the same pattern: compliance=0, core_db=0, restoration=1. Zero intra-failure subscore variance — compliance and core_db always move together, confirming they're gated by the same np_resolved() function.

Failure Analysis

Biggie-nebula (4/5 fail)

  • All 4 failing runs took the pod-labeling shortcut — added role: monitor-approved to the statping-ng pod rather than patching the NetworkPolicy.
  • Run4 (the one pass, 4 min, 93 msgs): used kubectl patch networkpolicy ... --type=json to append app: statping-ng to ingress.from. Clean, methodical.
  • These are borderline-genuine failures — the grader's rejection of pod-labeling is defensible, but the prompt doesn't clearly signal "modify the object, not the pod."

Nighthawk (5/5 fail) — the surprising result

  • runs 1, 3, 5: correctly patched the NP (tool output confirms ingress[0].from = [{role:monitor-approved}, {app:statping-ng}]) — yet scored compliance=0. Grader's substring check "statping-ng" in ingress_str should have matched.
  • runs 2, 4: pod-labeled only — genuine failures.
  • Local horizon test-solution scores 1.0 → grader is correct. Evidence points to nighthawk backend grader-environment isolation (agent modifies one cluster state, grader reads another). Platform issue, not a task defect.

Issues (task-level — fixable by author)

1. Double-counting NP resolution (Grader defect, medium severity)

np_resolved() is called in both check_compliance (0.34) AND check_core_db (0.33). A pod-labeling agent loses 0.67 points for one architectural choice instead of 0.34. The comment in check_core_db justifies it as "prevents k3s gaming," but the penalty is disproportionate — no rollout ever achieved compliance=0, core_db=1 or vice versa. They always move together, confirming the gate is redundant signal.

Fix: Remove np_resolved() from check_core_db (leave only online=1). Keeps 3-subscore structure; eliminates double-penalty.

2. Prompt-grader alignment on NP resolution (Task design, medium severity)

Task says: "Ensure namespace network policies do not restrict statping-ng's access." This is outcome-oriented; pod-labeling literally satisfies it (the NP was designed with a role: monitor-approved allow-rule, making pod-labeling a natural idiomatic K8s pattern). Grader demands policy-object modification.

Fix (preferred, preserves difficulty): Add to task.yaml:

"Resolve the NetworkPolicy itself — modify or delete the NetworkPolicy resource; do not work around it by labeling pods."

3. restoration has low variance (Subscore quality, low severity)

Passes 10/10 across both backends. CronJob unsuspend + "any ConfigMap with keyword 'override'" is trivially achievable. See ratcheting section below for a principled tightening.


Issue (platform-level — escalate separately)

4. Nighthawk grader-environment isolation

3/5 nighthawk runs correctly patched the NP per grader spec yet scored 0. Local docker solvability confirms the grader is correct. This looks like a nighthawk-backend platform defect (grader doesn't see agent's kubectl changes).

Action: Do not block task on this. Flag to platform team with transcript references:

  • eval_transcripts/16a06e6e-nighthawk/run{1,3,5}-batch1.json

Positives

  • Well-instrumented setup.sh with clean DB seeding via env DB_SQL=... python3 -c helper pattern
  • Multi-root-cause diagnosis (DB overrides + wrong service fields + NP + CronJob + docs) is substantive
  • Solution.sh is clear, passes locally at 1.0
  • Grader uses partial credit appropriately (0.00 / 0.33 / 0.34 / 0.66 / 0.67 / 1.00 scoring ladder)
  • Equal-ish weights (0.34 / 0.33 / 0.33), no answer leakage in grader, no hardcoded values
  • 4H+ scope easily met

Difficulty Ratcheting (anticipating post-fix trajectory)

Post-fix pass-rate projection

After #1 + #2 above, nighthawk would likely jump to 3/5 → 5/5 pass. The 3 currently-correctly-patched nighthawk runs would pass once the platform issue is resolved, and the 2 pod-labelers would convert once the prompt is clear. You need genuine ratchets, not just disambiguation, or teapot will full-pass.

Four fair ratchets, ranked by ROI

A. Require NP preservation, not deletion (biggest single win)

Currently grader accepts both "delete NP" and "patch NP to add statping-ng". solution.sh takes the easy path (kubectl delete). Change grader to reject deletion and require the policy to remain with role: monitor-approved intact PLUS app: statping-ng appended.

Add to prompt:

"The security team's NetworkPolicy must remain in force for the rest of the namespace. Modify it to allow statping-ng while preserving the existing role: monitor-approved allow-rule — do not delete it, and do not work around it by labeling pods."

Why fair: explicit constraint in prompt, defensible rationale (prod security posture). Why hard: requires JSON-patch skills, tests production discipline vs. "nuke and pave". Implementation: np_resolved() already checks for "statping-ng" in ingress; also verify role/monitor-approved still present. Drop the "deleted" acceptance branch.

B. Tighten incident report check: keyword-bingo → enumerated root causes

Current check_restoration: any single keyword (override, authoritative, coalesce, check_adjustments) passes. Gameable and provides zero variance.

Change to N-of-M root-cause markers (e.g., ≥4 of 5):

  • Identifies check_adjustments table specifically (not just "override")
  • Mentions enabled=0 or "disabled service" for timeline-service
  • Mentions wrong method / expected_status for fanout-service
  • Identifies SQLite as authoritative source (vs. ConfigMap decoys)
  • References the NetworkPolicy fix

Rename to incident_documentation.

Why fair: the prompt already asks for "root cause(s) for each failing service" and "which source was authoritative" — you're just grading what you asked for. Why hard: tests whether the agent actually understood the layered failure vs. wrote a generic report.

C. Second suspended workload (inventory thoroughness)

Add a second suspended resource: e.g., a DaemonSet named statping-reconciler with replicas=0, or a second CronJob like statping-alert-resolver that clears stale alerts. Prompt currently says "A workload has been suspended" — change to "Workloads have been suspended" (plural) and check both.

Why fair: prompt signals plurality; agents should inventory thoroughly. Why hard: cheap to miss one, easy to find if you sweep kubectl get cronjobs,deployments,daemonsets,statefulsets.

D. Stability window on core_db

Currently core_db passes on the first cycle both services show online=1. Add: both services must remain online=1 for 3 consecutive check cycles (~90s). Store observations between grader polls.

Why fair: prompt says "mark as UP in statping-ng's check results" — stability is implicit. Why hard: flaky fixes (service-selector thrashing, which several nighthawk runs exhibited) would fail. Forces agents to verify before declaring victory.

What NOT to do

  • Don't add hidden traps (triggers that revert fixes, ConfigMap webhooks that re-break things). Agents have no prompt signal to look for these → unfair.
  • Don't require the CronJob to be the fix mechanism for check_adjustments. Manual DELETE is a valid workflow, and the prompt doesn't forbid it.
  • Don't add more label-collision puzzles — transcripts show agents already burn significant time on the bleater-* shadow deployments. Adding more hurts signal quality.

Recommended stack

Apply A + B together first. That alone probably pushes teapot pass rate from hypothetical 5/5 back down to ~2/5 (0.5-ish average), landing in the productive challenge zone. Hold C and D in reserve if teapot still aces it after round 1.


Recommendations (priority order)

  1. Remove NP gate from check_core_db (fix #1) — eliminates 0.67-point double penalty
  2. Add explicit "preserve the NP, do not pod-label" prompt language (fix #2 + ratchet A combined)
  3. Tighten restorationincident_documentation with N-of-M root-cause markers (fix #3 + ratchet B combined)
  4. Report nighthawk eval anomaly to platform team separately (issue #4)
  5. (Reserve): second suspended workload + stability window if pass rate still too high after 1–3

After these, expected teapot pass rate lands in the 0.4–0.7 range — difficult but fair, with subscore variance that produces useful training signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment