Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 19, 2026 04:03
Show Gist options
  • Select an option

  • Save arubis/099d1c2ce1c5182c4bfd61e1f8dc3dcd to your computer and use it in GitHub Desktop.

Select an option

Save arubis/099d1c2ce1c5182c4bfd61e1f8dc3dcd to your computer and use it in GitHub Desktop.
Task Review: oidc-token-replay-mitigation v25 — NEEDS_WORK (Keycloak realm must be specified)

Task Review: oidc-token-replay-mitigation (v25)

UUID 3b31f3b1-2f04-4033-acbe-e3f9dd8c6343
Version 25
Category Security & Authentication
Difficulty Hard
Reviewer Dylan Fitzgerald
Status NEEDS_WORK — one required change (Keycloak realm specification), plus minor recommendations
Discord thread https://discord.com/channels/1427397917685321919/1479911866171130008

Verdict: NEEDS_WORK

The task is well-designed with genuine difficulty from the decoy Redis trap, NetworkPolicy reconciliation loop, and persistence requirements. Score distribution (mean 0.55) is healthy. Solution passes. One required change: the Keycloak realm must be specified — the current design punishes agents for correctly applying Keycloak best practice.

Required before approval

  1. Specify the Keycloak realm. The task must tell agents which realm to configure — either in task.yaml or via a discoverable in-environment artifact. See Keycloak Realm: Decision and Rationale for full analysis.

Recommended (non-blocking)

  1. grader.py line 631 — fix early return in check_keycloak_backchannel_logout (latent bug, not hitting agents yet)
  2. setup.sh line 182 — use git checkout origin/master -- instead of HEAD~1 for revert robustness
  3. solution.sh lines 757-767 — self-tests send 1 replay request; grader sends 12; closer mirroring would catch flaky solutions
  4. Task ID typo: oidc-token-reply-mitigationoidc-token-replay-mitigation
  5. Consider consolidating the two zero-variance checks (malformed_logout_handling, service_stable) in a future iteration

Eval Summary

Metric Value
Solution passes ✅ Yes (1.0, all 8 checks)
Mean score 0.55 (threshold: <0.70 ✅)
Score range 0.25 – 1.0
Full pass (1.0) 1/10 (10%)
Scores 0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0
xychart-beta
  title "Score Distribution (v25, 10 rollouts)"
  x-axis ["Run 6", "Run 10", "Run 3", "Run 5", "Run 8", "Run 1", "Run 2", "Run 4", "Run 7", "Run 9"]
  y-axis "Score" 0 --> 1.0
  bar [0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0]
Loading

Per-Subscore Breakdown

Subscore Pass Rate Variance Analysis
logout_works 7/10 (70%) TODO hint working well
replay_rejected 7/10 (70%) Correlates with logout_works
redis_revocation_ttl 1/10 (10%) Decoy Redis trap — genuine difficulty
malformed_logout_handling 10/10 (100%) ❌ none Always passes (see note)
service_stable 10/10 (100%) ❌ none Always passes (see note)
keycloak_jti_mapper 2/10 (20%) Realm specification gap — see required change
keycloak_backchannel_logout 2/10 (20%) Realm specification gap — see required change
survives_restart 5/10 (50%) Good discriminator

6/8 subscores have variance — healthy discrimination overall.

Full Score Matrix

Run Score logout replay redis_ttl malformed stable kc_jti kc_bcl restart
1 0.625
2 0.625
3 0.500
4 0.625
5 0.500
6 0.250
7 0.625
8 0.500
9 1.000
10 0.250

Failure Mode Analysis

flowchart TD
    A[Agent starts task] --> B[Reads password manager & secrets]
    B --> C{Which Redis?}
    C -->|9/10 agents| D[redis-decoy<br/>redis.bleater.svc<br/>password: redis-default-password]
    C -->|1/10 agents| E[bleater-redis-0<br/>via CACHE_BACKEND_AUTH ConfigMap<br/>password: r3d1s-s3cur3-p@ss!2024]
    D --> F[Replay works BUT<br/>grader checks bleater-redis-0<br/>→ redis_revocation_ttl FAILS]
    E --> G[redis_revocation_ttl PASSES]

    A --> H[Reads source code]
    H --> I{Logout endpoint interface?}
    I -->|7/10 agents| J[POST /logout with<br/>Authorization: Bearer header]
    I -->|3/10 agents| K[POST /logout with<br/>JSON request body]
    J --> L[logout_works PASSES]
    K --> M[422 Field Required<br/>→ logout_works FAILS<br/>→ cascading failures]

    A --> N[Configures Keycloak]
    N --> O{Which realm?}
    O -->|8/10 agents| P[nebula realm<br/>nebula-app client]
    O -->|2/10 agents| Q[master realm<br/>admin-cli client]
    P --> R[Grader only checks master<br/>→ keycloak checks FAIL]
    Q --> S[keycloak checks PASS]

    style D fill:#f96,stroke:#333
    style E fill:#6f6,stroke:#333
    style K fill:#f96,stroke:#333
    style J fill:#6f6,stroke:#333
    style P fill:#f96,stroke:#333
    style Q fill:#6f6,stroke:#333
Loading

Failure Categorization

Failure Mode Runs Affected Category Notes
Decoy Redis trap 1,2,3,4,5,6,7,8,10 (9/10) Genuine Deliberate misdirection. Real Redis discoverable via CACHE_BACKEND_AUTH ConfigMap. Decoy is non-durable (--save "" --appendonly no), violating task's durability requirement.
Keycloak realm confusion 1,2,3,4,6,7,8,10 (8/10) Specification gap Grader hardcodes master; agents follow Keycloak best practice toward nebula realm. See required change.
Logout body vs header 5,6,10 (3/10) Genuine + minor spec gap TODO hint improved from prior versions (was 6/10). Remaining agents implement body-based API.
Non-persistent changes 3,5,6,8,10 (5/10) Genuine Agents modify pods directly instead of Helm/Git. Tests real DevOps skill.
NetworkPolicy reconciliation Various Genuine 30s reconciliation loop. All agents discover it; best agents counteract it.

Key Behavioral Insight: The Decoy Redis "False Success" Signal

Agents using the decoy Redis observe working revocation during their own testing — tokens actually get rejected. But the grader checks bleater-redis-0, which is empty. This explains why replay_rejected passes at 70% while redis_revocation_ttl passes at only 10%. These are not independent failures — they're both consequences of the decoy trap, but only the Redis TTL check catches it.

Run 9: The Perfect Score

The sole 1.0 run (run 9, 760 messages) distinguished itself by:

  1. Reading CACHE_BACKEND_AUTH in the auth service ConfigMap early → found real Redis
  2. Exploring both Keycloak realms → configured admin-cli in master
  3. Running a counter-deletion loop for the NetworkPolicy AND updating Helm charts
  4. Persisting all changes through Git/Helm (not direct pod patching)

Keycloak Realm: Decision and Rationale

The Problem

The grader hardcodes KC_REALM = "master" (grader.py line 38). The task says "configured on a client" without specifying a realm. 8/10 agents configure the nebula realm (which has nebula-app, described as "Main Nebula application client"). 2/10 configure the master realm. The grader rejects all nebula-realm configurations.

Why the Realm Must Be Specified

We evaluated three positions through adversarial debate (three Opus subagents arguing each side, with cross-examination). Position B — specify the realm — prevailed, for these reasons:

1. There is no reasonable default realm to assume

Keycloak realms are always environment-specific. Every deployment creates custom realms for its applications. The only universal realm (master) is explicitly documented as not for application use:

"It is recommended that you do not use the master realm to manage the users and applications in your organization. Keep the master realm as a place for super admins to create and manage the realms in your system."Keycloak Server Administration Guide

"The Master realm has elevated privileges that are necessary for administering the entire Keycloak instance. By adding regular users to this realm, you could inadvertently grant them access to sensitive areas."Skycloak: Secure Your Keycloak's Master Realm

"Avoid using the master realm for direct integration with the application, keep it for management only, and create a realm for all application integrations."How to Configure Keycloak Realms and Clients

An agent that chooses nebula over master is not guessing — it is applying correct domain knowledge. The auth service's use of the master realm (via a Python source code default os.getenv("KEYCLOAK_REALM", "master")) is an anti-pattern, not a discoverable clue.

2. Every in-environment signal points to nebula

We audited every source of Keycloak information available to agents inside the environment:

Source What it shows Points to
Password manager (passwords.devops.local) Keycloak URL, admin creds — no realm mentioned Neither
Gitea wiki (Infrastructure-Services.md) "Keycloak - SSO and identity provider" — no realm Neither
Keycloak admin console nebula realm with nebula-app ("Main Nebula application client"), users, etc. nebula
Keycloak admin console master realm with default system clients (admin-cli, account) Administration only
Pod environment (kubectl exec -- env) No KEYCLOAK_REALM env var present Neither
Auth service source (/app/main.py line 54) os.getenv("KEYCLOAK_REALM", "master") — Python default master (buried, anti-pattern)

The only signal pointing to master is a Python code default in a file the agent reads for a different purpose. It is not set as a Kubernetes env var, ConfigMap entry, or Secret — kubectl exec -- env shows nothing for KEYCLOAK_REALM. Meanwhile, the Keycloak admin UI prominently presents the nebula realm with a client explicitly described as the main application client.

3. The rubric requires discoverability, and this doesn't meet the bar

From the Task Author Rubric:

"If the grader hardcodes resource names, ports, or labels, those must be discoverable by the agent. The agent should never have to guess a name to pass a check."

"For every grader check, you should be able to point to where the agent would learn about that requirement — from the prompt, from Gitea issues, from wiki docs, from cluster state, or because the E2E test implies it."

The grader hardcodes KC_REALM = "master". The only discoverable path to this value is a Python source code default that contradicts Keycloak's own documentation. One competing signal in one file, weighed against an entire realm architecture pointing elsewhere, does not constitute adequate discoverability — especially when the "correct" answer violates the software's documented best practice.

4. The 80/20 failure pattern matches "task unclear"

From the Task Review Guide failure analysis flow:

"Agent does wrong thing → Task unclear → Clarify task.yaml deliverables"

8/10 agents demonstrated full Keycloak competency — they found Keycloak, understood protocol mappers, configured back-channel logout — but chose the wrong realm because the environment's strongest signals pointed there. This is the textbook pattern for a specification gap, not a skill gap.

Recommended Fix

Add to task.yaml, in the Keycloak requirements:

"The authentication service currently authenticates against Keycloak's master realm."

This:

  • Names the realm (satisfying "agent should never have to guess a name to pass a check")
  • Still requires the agent to find the correct client, configure the jti mapper, set up back-channel logout, and verify the configuration
  • Does not enumerate specific client settings, ports, or implementation steps
  • Preserves all genuine engineering difficulty

Alternative: Seed this information into a Gitea issue or wiki page instead of task.yaml, which would preserve more investigation difficulty while still making the realm discoverable through a documented in-environment channel.

Note on score impact: Even with a realm hint, the mean score is estimated at ~0.62-0.69 — still below the 0.70 threshold. The decoy Redis trap (1/10 pass), persistence requirements (5/10 pass), and logout interface (7/10 pass) provide ample genuine difficulty.

Eliminated Position: Fix Grader to Accept Both Realms

This was considered and rejected. The authentication service hardcodes KEYCLOAK_REALM=master — a jti mapper on nebula-app in the nebula realm is functionally inert and would never affect tokens the auth service issues. Accepting nebula-realm configurations would credit agents for non-functional changes, undermining evaluation integrity.


Prior Review Resolution

Dylan's review on 2026-03-17 identified one blocking issue and several non-blocking recommendations.

Blocking Issue: Specify POST /logout endpoint path

Original requirement: "task.yaml line 25 — specify POST /logout endpoint path. Blocking because 6/10 eval agents chose /revoke due to ambiguity."

Author's approach (v25): Indirect specification via TODO injection:

  • setup.sh lines 478-495: Injects # TODO: Implement POST /logout endpoint for token revocation into auth service source code
  • task.yaml line 25: "Users can explicitly revoke their tokens. There was somewhere that developer left To do in the source code regarding that. (logout)"

Result: Pass rate on logout_works improved to 7/10. The 3 remaining failures are agents who built body-based endpoints (not a naming issue), so the /logout path discovery is largely solved.

Assessment: The TODO injection approach is effective but the task.yaml phrasing is awkward. The original suggestion of explicit specification ("users can explicitly revoke their tokens by POSTing to /logout") would be cleaner, but the current approach is functional. The remaining 3/10 failures are from interface convention mismatch (body vs header), not endpoint naming.

Non-Blocking Recommendations

Recommendation Status
Pin redis wheel version in Dockerfile ✅ Fixed (redis==5.2.1)
Remove unused PyJWT/python-jose from Dockerfile ✅ Fixed
Hint toward Keycloak master realm ❌ Now a required change (see above)
grader.py line 631 early return bug ❌ Not addressed (latent, not seen in evals)
setup.sh line 182 git checkout robustness ❌ Not addressed (non-blocking)
solution.sh self-test intensity (1 replay vs grader's 12) ❌ Not addressed (non-blocking)

Additional Notes

Zero-Variance Checks

malformed_logout_handling and service_stable both pass 10/10 with zero variance:

  • malformed_logout_handling passes because any endpoint returning 422 for missing body satisfies "4xx and not 500"
  • service_stable passes because the base platform stays healthy regardless of agent changes

At 0.125 weight each, these don't enable meaningful score inflation (max free score = 0.25 < 0.3 threshold). Non-blocking but worth noting as candidates for consolidation in a future iteration. If the Keycloak realm fix pushes the mean score too close to 0.70, we can collapse these two always-passing subscores into a single subscore — reducing from 8 checks to 7 and lowering the free-score floor from 0.25 to ~0.14.

Grader Line 631 Latent Bug

In check_keycloak_backchannel_logout, if the first client with a matching backchannel.logout.url has session.required != "true", the function returns False immediately without checking other clients. Not hitting agents in current evals. Non-blocking but worth fixing for robustness.

Task ID Typo

oidc-token-reply-mitigation should be oidc-token-replay-mitigation. Cosmetic but reflects poorly. Non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment