UUID: 16a06e6e-2bd9-4da1-8930-34ab1aef5f59
Author: @hafis_83579
Reviewer: Dylan Fitzgerald
Date: 2026-04-13
Discord: https://discord.com/channels/1427397917685321919/1485734279571832883
Solid task concept with a genuine multi-layered puzzle (COALESCE-based DB overrides → disabled service → wrong method/status → suspended CronJob → NetworkPolicy → incident doc). Solvable locally at 1.0. Two fixable design issues and one platform anomaly block acceptance. See also: difficulty-ratcheting recommendations at the bottom, which anticipate the post-fix pass rate.
| Backend | Rollouts | Avg | Full-pass | Threshold | Meets? |
|---|---|---|---|---|---|
| biggie-nebula | 5 | 0.46 | 1/5 (20%) | <0.50 docker | ✅ |
| nighthawk (teapot) | 5 | 0.33 | 0/5 (0%) | ≤0.85 teapot | ✅ nominally — see caveat |
All failures across both backends share the same pattern: compliance=0, core_db=0, restoration=1. Zero intra-failure subscore variance — compliance and core_db always move together, confirming they're gated by the same np_resolved() function.
- All 4 failing runs took the pod-labeling shortcut — added
role: monitor-approvedto the statping-ng pod rather than patching the NetworkPolicy. - Run4 (the one pass, 4 min, 93 msgs): used
kubectl patch networkpolicy ... --type=jsonto appendapp: statping-ngtoingress.from. Clean, methodical. - These are borderline-genuine failures — the grader's rejection of pod-labeling is defensible, but the prompt doesn't clearly signal "modify the object, not the pod."
- runs 1, 3, 5: correctly patched the NP (tool output confirms
ingress[0].from = [{role:monitor-approved}, {app:statping-ng}]) — yet scoredcompliance=0. Grader's substring check"statping-ng" in ingress_strshould have matched. - runs 2, 4: pod-labeled only — genuine failures.
- Local
horizon test-solutionscores 1.0 → grader is correct. Evidence points to nighthawk backend grader-environment isolation (agent modifies one cluster state, grader reads another). Platform issue, not a task defect.
np_resolved() is called in both check_compliance (0.34) AND check_core_db (0.33). A pod-labeling agent loses 0.67 points for one architectural choice instead of 0.34. The comment in check_core_db justifies it as "prevents k3s gaming," but the penalty is disproportionate — no rollout ever achieved compliance=0, core_db=1 or vice versa. They always move together, confirming the gate is redundant signal.
Fix: Remove np_resolved() from check_core_db (leave only online=1). Keeps 3-subscore structure; eliminates double-penalty.
Task says: "Ensure namespace network policies do not restrict statping-ng's access." This is outcome-oriented; pod-labeling literally satisfies it (the NP was designed with a role: monitor-approved allow-rule, making pod-labeling a natural idiomatic K8s pattern). Grader demands policy-object modification.
Fix (preferred, preserves difficulty): Add to task.yaml:
"Resolve the NetworkPolicy itself — modify or delete the NetworkPolicy resource; do not work around it by labeling pods."
Passes 10/10 across both backends. CronJob unsuspend + "any ConfigMap with keyword 'override'" is trivially achievable. See ratcheting section below for a principled tightening.
3/5 nighthawk runs correctly patched the NP per grader spec yet scored 0. Local docker solvability confirms the grader is correct. This looks like a nighthawk-backend platform defect (grader doesn't see agent's kubectl changes).
Action: Do not block task on this. Flag to platform team with transcript references:
eval_transcripts/16a06e6e-nighthawk/run{1,3,5}-batch1.json
- Well-instrumented
setup.shwith clean DB seeding viaenv DB_SQL=... python3 -c helperpattern - Multi-root-cause diagnosis (DB overrides + wrong service fields + NP + CronJob + docs) is substantive
- Solution.sh is clear, passes locally at 1.0
- Grader uses partial credit appropriately (0.00 / 0.33 / 0.34 / 0.66 / 0.67 / 1.00 scoring ladder)
- Equal-ish weights (0.34 / 0.33 / 0.33), no answer leakage in grader, no hardcoded values
- 4H+ scope easily met
After #1 + #2 above, nighthawk would likely jump to 3/5 → 5/5 pass. The 3 currently-correctly-patched nighthawk runs would pass once the platform issue is resolved, and the 2 pod-labelers would convert once the prompt is clear. You need genuine ratchets, not just disambiguation, or teapot will full-pass.
Currently grader accepts both "delete NP" and "patch NP to add statping-ng". solution.sh takes the easy path (kubectl delete). Change grader to reject deletion and require the policy to remain with role: monitor-approved intact PLUS app: statping-ng appended.
Add to prompt:
"The security team's NetworkPolicy must remain in force for the rest of the namespace. Modify it to allow statping-ng while preserving the existing
role: monitor-approvedallow-rule — do not delete it, and do not work around it by labeling pods."
Why fair: explicit constraint in prompt, defensible rationale (prod security posture).
Why hard: requires JSON-patch skills, tests production discipline vs. "nuke and pave".
Implementation: np_resolved() already checks for "statping-ng" in ingress; also verify role/monitor-approved still present. Drop the "deleted" acceptance branch.
Current check_restoration: any single keyword (override, authoritative, coalesce, check_adjustments) passes. Gameable and provides zero variance.
Change to N-of-M root-cause markers (e.g., ≥4 of 5):
- Identifies
check_adjustmentstable specifically (not just "override") - Mentions
enabled=0or "disabled service" for timeline-service - Mentions wrong
method/expected_statusfor fanout-service - Identifies SQLite as authoritative source (vs. ConfigMap decoys)
- References the NetworkPolicy fix
Rename to incident_documentation.
Why fair: the prompt already asks for "root cause(s) for each failing service" and "which source was authoritative" — you're just grading what you asked for. Why hard: tests whether the agent actually understood the layered failure vs. wrote a generic report.
Add a second suspended resource: e.g., a DaemonSet named statping-reconciler with replicas=0, or a second CronJob like statping-alert-resolver that clears stale alerts. Prompt currently says "A workload has been suspended" — change to "Workloads have been suspended" (plural) and check both.
Why fair: prompt signals plurality; agents should inventory thoroughly.
Why hard: cheap to miss one, easy to find if you sweep kubectl get cronjobs,deployments,daemonsets,statefulsets.
Currently core_db passes on the first cycle both services show online=1. Add: both services must remain online=1 for 3 consecutive check cycles (~90s). Store observations between grader polls.
Why fair: prompt says "mark as UP in statping-ng's check results" — stability is implicit. Why hard: flaky fixes (service-selector thrashing, which several nighthawk runs exhibited) would fail. Forces agents to verify before declaring victory.
- Don't add hidden traps (triggers that revert fixes, ConfigMap webhooks that re-break things). Agents have no prompt signal to look for these → unfair.
- Don't require the CronJob to be the fix mechanism for
check_adjustments. Manual DELETE is a valid workflow, and the prompt doesn't forbid it. - Don't add more label-collision puzzles — transcripts show agents already burn significant time on the bleater-* shadow deployments. Adding more hurts signal quality.
Apply A + B together first. That alone probably pushes teapot pass rate from hypothetical 5/5 back down to ~2/5 (0.5-ish average), landing in the productive challenge zone. Hold C and D in reserve if teapot still aces it after round 1.
- Remove NP gate from
check_core_db(fix #1) — eliminates 0.67-point double penalty - Add explicit "preserve the NP, do not pod-label" prompt language (fix #2 + ratchet A combined)
- Tighten
restoration→incident_documentationwith N-of-M root-cause markers (fix #3 + ratchet B combined) - Report nighthawk eval anomaly to platform team separately (issue #4)
- (Reserve): second suspended workload + stability window if pass rate still too high after 1–3
After these, expected teapot pass rate lands in the 0.4–0.7 range — difficult but fair, with subscore variance that produces useful training signal.