Task: bd24c35b-157b-400b-bcdb-88e539b2467c
Version: 18 · Category: SRE · Difficulty: hard
Verdict: NEEDS_WORK
Solution passes (1.0). Mean score 0.60 across 8 biggie-nebula runs — below 0.70. Every subscore has variance. The task is well-designed and close to approval, but has one grader defect that produces non-deterministic failures unrelated to agent skill.
Scores: [0.6, 0.6, 0.6, 0.6, 0.4, 0.8, 0.4, 0.8] · Mean: 0.60
| Subscore (0.20 each) | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | Pass Rate |
|---|---|---|---|---|---|---|---|---|---|
| rotation_and_uptime | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 3/8 (38%) |
| token_validation | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 6/8 (75%) |
| prometheus_alerting | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 5/8 (63%) |
| security_hardening | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 3/8 (38%) |
| session_revocation | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7/8 (88%) |
The problem. rotation_and_uptime merges key_rotation + auth_error_rate with AND semantics. Key rotation passes in all 8 runs. But auth_error_rate fails 5/8 due to a test-harness artifact: after agents reduce accessTokenLifespan to ≤300s (required for security_hardening), the auth-load-test's periodic saved-token revalidation logs AUTH_FAILURE: old_token_invalid once the saved token expires. The grader's --since=1m window catches 1-3 of these events in ~28 samples, exceeding the 0.1% threshold.
Why this is artificial, not genuine. I analyzed all 8 transcripts to check whether passing agents behaved differently from failing agents. They don't — all 8 perform the same operations in similar order. The most telling case: Run 6 (best overall at 0.8) restarts auth-load-test late with a clean 102/102 log, but the grader catches 1/28 stale-token events. Meanwhile Run 1 (0.6 overall) passes with 0/29. Same strategy, different outcome. The check measures when the grader fires relative to a periodic background event, not agent competence.
The fix. Filter old_token_invalid alongside the existing connection_error exclusion (grader.py line 702):
# Current:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l and "connection_error" not in l]
# Fixed:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l
and "connection_error" not in l
and "old_token_invalid" not in l]This preserves the check's intent (detect real outages during rotation) while filtering the harness artifact. The check still catches agents who break authentication for new requests — it just stops penalizing agents whose rotation succeeded but whose saved-token expired as a downstream effect of correct security hardening.
Score impact. All 5 failing runs have key_rotation passing, so rotation_and_uptime flips to 8/8. Projected new mean: ~0.725 (over 0.70 threshold). See next section.
The auth_error_rate fix will push the mean above 0.70. Here are agent-fair options to bring it back down, in priority order. These are recommendations — apply after the fix, re-eval, and adjust based on actual scores.
What: In setup.sh, create offline sessions for legacy-api-service using scope=openid offline_access. In the grader, add a sub-check querying GET /admin/realms/bleater/users/{userId}/offline-sessions/{clientId} and requiring 0 offline sessions.
Why it's fair: task.yaml says "some appear to have no hard expiry at all." Offline sessions are literally the Keycloak sessions with no hard expiry — especially with offlineSessionMaxLifespanEnabled=false, which setup already configures. Regular session revocation (DELETE /users/{id}/sessions) does NOT remove offline sessions; agents must discover the separate API. This is real incident-response knowledge: knowing that offline tokens survive session logout is exactly the kind of thing that lets compromised access persist.
Estimated impact: session_revocation drops from 7/8 to ~4-5/8. Mean drops ~0.05.
What: In check_session_revocation, verify legacy-api-service user has enabled=false. All 8 current agents already do this, so it doesn't affect current scores — but it prevents future agents from getting credit by only revoking sessions without disabling the account. The task says "access must be revoked" — an enabled account can create new sessions.
- Prometheus scrape target timing (R1, R5 fail): Consider 2-3 retry attempts with 15s delays in
check_prometheus_alertingbefore reportinghealth: down. Minor timing race after Prometheus restarts. rate()/increase()PromQL requirement (R2 fails): Hidden requirement — task doesn't specify PromQL function choice. Low impact (1/8) but worth noting.- Minor typo in task.yaml line 49: "access any other there doesn't" — missing punctuation/word.
- Grader quality is excellent. Live-service checks (Keycloak Admin API, Prometheus API, real token issuance). Strong anti-gaming.
- Prior feedback fully addressed. Grafana→Mattermost accepted. Issuer hint added (was 0/10, now 3/8). Offline session thresholds relaxed. Always-passing checks merged.
- Genuine failure variety. token_validation (audience mapper), security_hardening (revokeRefreshToken), prometheus_alerting (scrape target + PromQL) — three independent failure modes producing real discrimination.
- Good scope and framing. Realistic incident report, investigation-based design, appropriate for hard/SRE.
- Apply the
auth_error_ratefilter fix - Add offline session creation in setup + revocation check in grader (option 1)
- Add user-disabled guard rail (option 2)
- Re-eval 8 runs on the new version
- Assess whether mean is below 0.70 — if not, we can iterate from there