Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 25, 2026 20:41
Show Gist options
  • Select an option

  • Save arubis/388f6f368eb29190ad9bb2b2d3ffe479 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/388f6f368eb29190ad9bb2b2d3ffe479 to your computer and use it in GitHub Desktop.
Review: Keycloak OIDC Token Signing Key Rotation v18 (bd24c35b)

Review: Keycloak OIDC Token Signing Key Rotation (v18)

Task: bd24c35b-157b-400b-bcdb-88e539b2467c
Version: 18 · Category: SRE · Difficulty: hard
Verdict: NEEDS_WORK

Solution passes (1.0). Mean score 0.60 across 8 biggie-nebula runs — below 0.70. Every subscore has variance. The task is well-designed and close to approval, but has one grader defect that produces non-deterministic failures unrelated to agent skill.


Eval Data

Scores: [0.6, 0.6, 0.6, 0.6, 0.4, 0.8, 0.4, 0.8] · Mean: 0.60

Subscore (0.20 each) R1 R2 R3 R4 R5 R6 R7 R8 Pass Rate
rotation_and_uptime 1 1 0 0 1 0 0 0 3/8 (38%)
token_validation 1 1 1 1 0 1 0 1 6/8 (75%)
prometheus_alerting 0 0 1 1 0 1 1 1 5/8 (63%)
security_hardening 0 0 0 1 0 1 0 1 3/8 (38%)
session_revocation 1 1 1 0 1 1 1 1 7/8 (88%)

Blocking: Fix auth_error_rate timing artifact

The problem. rotation_and_uptime merges key_rotation + auth_error_rate with AND semantics. Key rotation passes in all 8 runs. But auth_error_rate fails 5/8 due to a test-harness artifact: after agents reduce accessTokenLifespan to ≤300s (required for security_hardening), the auth-load-test's periodic saved-token revalidation logs AUTH_FAILURE: old_token_invalid once the saved token expires. The grader's --since=1m window catches 1-3 of these events in ~28 samples, exceeding the 0.1% threshold.

Why this is artificial, not genuine. I analyzed all 8 transcripts to check whether passing agents behaved differently from failing agents. They don't — all 8 perform the same operations in similar order. The most telling case: Run 6 (best overall at 0.8) restarts auth-load-test late with a clean 102/102 log, but the grader catches 1/28 stale-token events. Meanwhile Run 1 (0.6 overall) passes with 0/29. Same strategy, different outcome. The check measures when the grader fires relative to a periodic background event, not agent competence.

The fix. Filter old_token_invalid alongside the existing connection_error exclusion (grader.py line 702):

# Current:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l and "connection_error" not in l]

# Fixed:
auth_failure_lines = [l for l in lines if "AUTH_FAILURE" in l
                      and "connection_error" not in l
                      and "old_token_invalid" not in l]

This preserves the check's intent (detect real outages during rotation) while filtering the harness artifact. The check still catches agents who break authentication for new requests — it just stops penalizing agents whose rotation succeeded but whose saved-token expired as a downstream effect of correct security hardening.

Score impact. All 5 failing runs have key_rotation passing, so rotation_and_uptime flips to 8/8. Projected new mean: ~0.725 (over 0.70 threshold). See next section.


Difficulty Restoration

The auth_error_rate fix will push the mean above 0.70. Here are agent-fair options to bring it back down, in priority order. These are recommendations — apply after the fix, re-eval, and adjust based on actual scores.

1. Add offline session revocation (targets session_revocation, currently 88% pass)

What: In setup.sh, create offline sessions for legacy-api-service using scope=openid offline_access. In the grader, add a sub-check querying GET /admin/realms/bleater/users/{userId}/offline-sessions/{clientId} and requiring 0 offline sessions.

Why it's fair: task.yaml says "some appear to have no hard expiry at all." Offline sessions are literally the Keycloak sessions with no hard expiry — especially with offlineSessionMaxLifespanEnabled=false, which setup already configures. Regular session revocation (DELETE /users/{id}/sessions) does NOT remove offline sessions; agents must discover the separate API. This is real incident-response knowledge: knowing that offline tokens survive session logout is exactly the kind of thing that lets compromised access persist.

Estimated impact: session_revocation drops from 7/8 to ~4-5/8. Mean drops ~0.05.

2. Add user-disabled guard rail (zero current impact)

What: In check_session_revocation, verify legacy-api-service user has enabled=false. All 8 current agents already do this, so it doesn't affect current scores — but it prevents future agents from getting credit by only revoking sessions without disabling the account. The task says "access must be revoked" — an enabled account can create new sessions.


Non-Blocking Notes

  • Prometheus scrape target timing (R1, R5 fail): Consider 2-3 retry attempts with 15s delays in check_prometheus_alerting before reporting health: down. Minor timing race after Prometheus restarts.
  • rate()/increase() PromQL requirement (R2 fails): Hidden requirement — task doesn't specify PromQL function choice. Low impact (1/8) but worth noting.
  • Minor typo in task.yaml line 49: "access any other there doesn't" — missing punctuation/word.

What's Working Well

  • Grader quality is excellent. Live-service checks (Keycloak Admin API, Prometheus API, real token issuance). Strong anti-gaming.
  • Prior feedback fully addressed. Grafana→Mattermost accepted. Issuer hint added (was 0/10, now 3/8). Offline session thresholds relaxed. Always-passing checks merged.
  • Genuine failure variety. token_validation (audience mapper), security_hardening (revokeRefreshToken), prometheus_alerting (scrape target + PromQL) — three independent failure modes producing real discrimination.
  • Good scope and framing. Realistic incident report, investigation-based design, appropriate for hard/SRE.

Suggested Path Forward

  1. Apply the auth_error_rate filter fix
  2. Add offline session creation in setup + revocation check in grader (option 1)
  3. Add user-disabled guard rail (option 2)
  4. Re-eval 8 runs on the new version
  5. Assess whether mean is below 0.70 — if not, we can iterate from there
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment