Review: Keycloak OIDC Token Signing Key Rotation (v18)

Task: bd24c35b-157b-400b-bcdb-88e539b2467c
Version: 18 · Category: SRE · Difficulty: hard
Verdict: NEEDS_WORK

Solution passes (1.0). Mean score 0.60 across 8 biggie-nebula runs — below 0.70. Every subscore has variance. The task is well-designed and close to approval, but has one grader defect that produces non-deterministic failures unrelated to agent skill.

Eval Data

Scores: [0.6, 0.6, 0.6, 0.6, 0.4, 0.8, 0.4, 0.8] · Mean: 0.60

Subscore (0.20 each)	R1	R2	R3	R4	R5	R6	R7	R8	Pass Rate
rotation_and_uptime	1	1	0	0	1	0	0	0	3/8 (38%)
token_validation	1	1	1	1	0	1	0	1	6/8 (75%)
prometheus_alerting	0	0	1	1	0	1	1	1	5/8 (63%)
security_hardening	0	0	0	1	0	1	0	1	3/8 (38%)
session_revocation	1	1	1	0	1	1	1	1	7/8 (88%)

Blocking: Fix `auth_error_rate` timing artifact

The problem. rotation_and_uptime merges key_rotation + auth_error_rate with AND semantics. Key rotation passes in all 8 runs. But auth_error_rate fails 5/8 due to a test-harness artifact: after agents reduce accessTokenLifespan to ≤300s (required for security_hardening), the auth-load-test's periodic saved-token revalidation logs AUTH_FAILURE: old_token_invalid once the saved token expires. The grader's --since=1m window catches 1-3 of these events in ~28 samples, exceeding the 0.1% threshold.

Why this is artificial, not genuine. I analyzed all 8 transcripts to check whether passing agents behaved differently from failing agents. They don't — all 8 perform the same operations in similar order. The most telling case: Run 6 (best overall at 0.8) restarts auth-load-test late with a clean 102/102 log, but the grader catches 1/28 stale-token events. Meanwhile Run 1 (0.6 overall) passes with 0/29. Same strategy, different outcome. The check measures when the grader fires relative to a periodic background event, not agent competence.

The fix. Filter old_token_invalid alongside the existing connection_error exclusion (grader.py line 702):

# Current:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l and "connection_error" not in l]

# Fixed:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l
                      and "connection_error" not in l
                      and "old_token_invalid" not in l]

This preserves the check's intent (detect real outages during rotation) while filtering the harness artifact. The check still catches agents who break authentication for new requests — it just stops penalizing agents whose rotation succeeded but whose saved-token expired as a downstream effect of correct security hardening.

Score impact. All 5 failing runs have key_rotation passing, so rotation_and_uptime flips to 8/8. Projected new mean: ~0.725 (over 0.70 threshold). See next section.

Difficulty Restoration

The auth_error_rate fix will push the mean above 0.70. Here are agent-fair options to bring it back down, in priority order. These are recommendations — apply after the fix, re-eval, and adjust based on actual scores.

1. Add offline session revocation (targets `session_revocation`, currently 88% pass)

What: In setup.sh, create offline sessions for legacy-api-service using scope=openid offline_access. In the grader, add a sub-check querying GET /admin/realms/bleater/users/{userId}/offline-sessions/{clientId} and requiring 0 offline sessions.

Why it's fair: task.yaml says "some appear to have no hard expiry at all." Offline sessions are literally the Keycloak sessions with no hard expiry — especially with offlineSessionMaxLifespanEnabled=false, which setup already configures. Regular session revocation (DELETE /users/{id}/sessions) does NOT remove offline sessions; agents must discover the separate API. This is real incident-response knowledge: knowing that offline tokens survive session logout is exactly the kind of thing that lets compromised access persist.

Estimated impact: session_revocation drops from 7/8 to ~4-5/8. Mean drops ~0.05.

2. Add user-disabled guard rail (zero current impact)

What: In check_session_revocation, verify legacy-api-service user has enabled=false. All 8 current agents already do this, so it doesn't affect current scores — but it prevents future agents from getting credit by only revoking sessions without disabling the account. The task says "access must be revoked" — an enabled account can create new sessions.

Non-Blocking Notes

Prometheus scrape target timing (R1, R5 fail): Consider 2-3 retry attempts with 15s delays in check_prometheus_alerting before reporting health: down. Minor timing race after Prometheus restarts.
rate()/increase() PromQL requirement (R2 fails): Hidden requirement — task doesn't specify PromQL function choice. Low impact (1/8) but worth noting.
Minor typo in task.yaml line 49: "access any other there doesn't" — missing punctuation/word.

What's Working Well

Grader quality is excellent. Live-service checks (Keycloak Admin API, Prometheus API, real token issuance). Strong anti-gaming.
Prior feedback fully addressed. Grafana→Mattermost accepted. Issuer hint added (was 0/10, now 3/8). Offline session thresholds relaxed. Always-passing checks merged.
Genuine failure variety. token_validation (audience mapper), security_hardening (revokeRefreshToken), prometheus_alerting (scrape target + PromQL) — three independent failure modes producing real discrimination.
Good scope and framing. Realistic incident report, investigation-based design, appropriate for hard/SRE.

Suggested Path Forward

Apply the auth_error_rate filter fix
Add offline session creation in setup + revocation check in grader (option 1)
Add user-disabled guard rail (option 2)
Re-eval 8 runs on the new version
Assess whether mean is below 0.70 — if not, we can iterate from there

arubis/keycloak-rotation-review-v18.md

Select an option

No results found

Select an option

No results found

Review: Keycloak OIDC Token Signing Key Rotation (v18)

Eval Data

Blocking: Fix `auth_error_rate` timing artifact

Difficulty Restoration

1. Add offline session revocation (targets `session_revocation`, currently 88% pass)

2. Add user-disabled guard rail (zero current impact)

Non-Blocking Notes

What's Working Well

Suggested Path Forward

arubis/keycloak-rotation-review-v18.md

Review: Keycloak OIDC Token Signing Key Rotation (v18)

Eval Data

Blocking: Fix auth_error_rate timing artifact

Difficulty Restoration

1. Add offline session revocation (targets session_revocation, currently 88% pass)

2. Add user-disabled guard rail (zero current impact)

Non-Blocking Notes

What's Working Well

Suggested Path Forward

Blocking: Fix `auth_error_rate` timing artifact

1. Add offline session revocation (targets `session_revocation`, currently 88% pass)