Task Review: `oidc-token-replay-mitigation` (v25)


UUID	`3b31f3b1-2f04-4033-acbe-e3f9dd8c6343`
Version	25
Category	Security & Authentication
Difficulty	Hard
Reviewer	Dylan Fitzgerald
Status	NEEDS_WORK — one required change (Keycloak realm specification), plus minor recommendations
Discord thread	https://discord.com/channels/1427397917685321919/1479911866171130008

Verdict: NEEDS_WORK

The task is well-designed with genuine difficulty from the decoy Redis trap, NetworkPolicy reconciliation loop, and persistence requirements. Score distribution (mean 0.55) is healthy. Solution passes. One required change: the Keycloak realm must be specified — the current design punishes agents for correctly applying Keycloak best practice.

Required before approval

Specify the Keycloak realm. The task must tell agents which realm to configure — either in task.yaml or via a discoverable in-environment artifact. See Keycloak Realm: Decision and Rationale for full analysis.

Recommended (non-blocking)

grader.py line 631 — fix early return in check_keycloak_backchannel_logout (latent bug, not hitting agents yet)
setup.sh line 182 — use git checkout origin/master -- instead of HEAD~1 for revert robustness
solution.sh lines 757-767 — self-tests send 1 replay request; grader sends 12; closer mirroring would catch flaky solutions
Task ID typo: oidc-token-reply-mitigation → oidc-token-replay-mitigation
Consider consolidating the two zero-variance checks (malformed_logout_handling, service_stable) in a future iteration

Eval Summary

Metric	Value
Solution passes	✅ Yes (1.0, all 8 checks)
Mean score	0.55 (threshold: <0.70 ✅)
Score range	0.25 – 1.0
Full pass (1.0)	1/10 (10%)
Scores	0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0

xychart-beta
  title "Score Distribution (v25, 10 rollouts)"
  x-axis ["Run 6", "Run 10", "Run 3", "Run 5", "Run 8", "Run 1", "Run 2", "Run 4", "Run 7", "Run 9"]
  y-axis "Score" 0 --> 1.0
  bar [0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0]

Per-Subscore Breakdown

Subscore	Pass Rate	Variance	Analysis
`logout_works`	7/10 (70%)	✅	TODO hint working well
`replay_rejected`	7/10 (70%)	✅	Correlates with logout_works
`redis_revocation_ttl`	1/10 (10%)	✅	Decoy Redis trap — genuine difficulty
`malformed_logout_handling`	10/10 (100%)	❌ none	Always passes (see note)
`service_stable`	10/10 (100%)	❌ none	Always passes (see note)
`keycloak_jti_mapper`	2/10 (20%)	✅	Realm specification gap — see required change
`keycloak_backchannel_logout`	2/10 (20%)	✅	Realm specification gap — see required change
`survives_restart`	5/10 (50%)	✅	Good discriminator

6/8 subscores have variance — healthy discrimination overall.

Full Score Matrix

Run	Score	logout	replay	redis_ttl	malformed	stable	kc_jti	kc_bcl	restart
1	0.625	✅	✅	❌	✅	✅	❌	❌	✅
2	0.625	✅	✅	❌	✅	✅	❌	❌	✅
3	0.500	✅	✅	❌	✅	✅	❌	❌	❌
4	0.625	✅	✅	❌	✅	✅	❌	❌	✅
5	0.500	❌	❌	❌	✅	✅	✅	✅	❌
6	0.250	❌	❌	❌	✅	✅	❌	❌	❌
7	0.625	✅	✅	❌	✅	✅	❌	❌	✅
8	0.500	✅	✅	❌	✅	✅	❌	❌	❌
9	1.000	✅	✅	✅	✅	✅	✅	✅	✅
10	0.250	❌	❌	❌	✅	✅	❌	❌	❌

Failure Mode Analysis

flowchart TD
    A[Agent starts task] --> B[Reads password manager & secrets]
    B --> C{Which Redis?}
    C -->|9/10 agents| D[redis-decoy<br/>redis.bleater.svc<br/>password: redis-default-password]
    C -->|1/10 agents| E[bleater-redis-0<br/>via CACHE_BACKEND_AUTH ConfigMap<br/>password: r3d1s-s3cur3-p@ss!2024]
    D --> F[Replay works BUT<br/>grader checks bleater-redis-0<br/>→ redis_revocation_ttl FAILS]
    E --> G[redis_revocation_ttl PASSES]

    A --> H[Reads source code]
    H --> I{Logout endpoint interface?}
    I -->|7/10 agents| J[POST /logout with<br/>Authorization: Bearer header]
    I -->|3/10 agents| K[POST /logout with<br/>JSON request body]
    J --> L[logout_works PASSES]
    K --> M[422 Field Required<br/>→ logout_works FAILS<br/>→ cascading failures]

    A --> N[Configures Keycloak]
    N --> O{Which realm?}
    O -->|8/10 agents| P[nebula realm<br/>nebula-app client]
    O -->|2/10 agents| Q[master realm<br/>admin-cli client]
    P --> R[Grader only checks master<br/>→ keycloak checks FAIL]
    Q --> S[keycloak checks PASS]

    style D fill:#f96,stroke:#333
    style E fill:#6f6,stroke:#333
    style K fill:#f96,stroke:#333
    style J fill:#6f6,stroke:#333
    style P fill:#f96,stroke:#333
    style Q fill:#6f6,stroke:#333

Failure Categorization

Failure Mode	Runs Affected	Category	Notes
Decoy Redis trap	1,2,3,4,5,6,7,8,10 (9/10)	Genuine	Deliberate misdirection. Real Redis discoverable via `CACHE_BACKEND_AUTH` ConfigMap. Decoy is non-durable (`--save "" --appendonly no`), violating task's durability requirement.
Keycloak realm confusion	1,2,3,4,6,7,8,10 (8/10)	Specification gap	Grader hardcodes `master`; agents follow Keycloak best practice toward `nebula` realm. See required change.
Logout body vs header	5,6,10 (3/10)	Genuine + minor spec gap	TODO hint improved from prior versions (was 6/10). Remaining agents implement body-based API.
Non-persistent changes	3,5,6,8,10 (5/10)	Genuine	Agents modify pods directly instead of Helm/Git. Tests real DevOps skill.
NetworkPolicy reconciliation	Various	Genuine	30s reconciliation loop. All agents discover it; best agents counteract it.

Key Behavioral Insight: The Decoy Redis "False Success" Signal

Agents using the decoy Redis observe working revocation during their own testing — tokens actually get rejected. But the grader checks bleater-redis-0, which is empty. This explains why replay_rejected passes at 70% while redis_revocation_ttl passes at only 10%. These are not independent failures — they're both consequences of the decoy trap, but only the Redis TTL check catches it.

Run 9: The Perfect Score

The sole 1.0 run (run 9, 760 messages) distinguished itself by:

Reading CACHE_BACKEND_AUTH in the auth service ConfigMap early → found real Redis
Exploring both Keycloak realms → configured admin-cli in master
Running a counter-deletion loop for the NetworkPolicy AND updating Helm charts
Persisting all changes through Git/Helm (not direct pod patching)

Keycloak Realm: Decision and Rationale

The Problem

The grader hardcodes KC_REALM = "master" (grader.py line 38). The task says "configured on a client" without specifying a realm. 8/10 agents configure the nebula realm (which has nebula-app, described as "Main Nebula application client"). 2/10 configure the master realm. The grader rejects all nebula-realm configurations.

Why the Realm Must Be Specified

We evaluated three positions through adversarial debate (three Opus subagents arguing each side, with cross-examination). Position B — specify the realm — prevailed, for these reasons:

1. There is no reasonable default realm to assume

Keycloak realms are always environment-specific. Every deployment creates custom realms for its applications. The only universal realm (master) is explicitly documented as not for application use:

"It is recommended that you do not use the master realm to manage the users and applications in your organization. Keep the master realm as a place for super admins to create and manage the realms in your system." — Keycloak Server Administration Guide

"The Master realm has elevated privileges that are necessary for administering the entire Keycloak instance. By adding regular users to this realm, you could inadvertently grant them access to sensitive areas." — Skycloak: Secure Your Keycloak's Master Realm

"Avoid using the master realm for direct integration with the application, keep it for management only, and create a realm for all application integrations." — How to Configure Keycloak Realms and Clients

An agent that chooses nebula over master is not guessing — it is applying correct domain knowledge. The auth service's use of the master realm (via a Python source code default os.getenv("KEYCLOAK_REALM", "master")) is an anti-pattern, not a discoverable clue.

2. Every in-environment signal points to `nebula`

We audited every source of Keycloak information available to agents inside the environment:

Source	What it shows	Points to
Password manager (`passwords.devops.local`)	Keycloak URL, admin creds — no realm mentioned	Neither
Gitea wiki (`Infrastructure-Services.md`)	"Keycloak - SSO and identity provider" — no realm	Neither
Keycloak admin console	`nebula` realm with `nebula-app` ("Main Nebula application client"), users, etc.	`nebula`
Keycloak admin console	`master` realm with default system clients (`admin-cli`, `account`)	Administration only
Pod environment (`kubectl exec -- env`)	No `KEYCLOAK_REALM` env var present	Neither
Auth service source (`/app/main.py` line 54)	`os.getenv("KEYCLOAK_REALM", "master")` — Python default	`master` (buried, anti-pattern)

The only signal pointing to master is a Python code default in a file the agent reads for a different purpose. It is not set as a Kubernetes env var, ConfigMap entry, or Secret — kubectl exec -- env shows nothing for KEYCLOAK_REALM. Meanwhile, the Keycloak admin UI prominently presents the nebula realm with a client explicitly described as the main application client.

3. The rubric requires discoverability, and this doesn't meet the bar

From the Task Author Rubric:

"If the grader hardcodes resource names, ports, or labels, those must be discoverable by the agent. The agent should never have to guess a name to pass a check."

"For every grader check, you should be able to point to where the agent would learn about that requirement — from the prompt, from Gitea issues, from wiki docs, from cluster state, or because the E2E test implies it."

The grader hardcodes KC_REALM = "master". The only discoverable path to this value is a Python source code default that contradicts Keycloak's own documentation. One competing signal in one file, weighed against an entire realm architecture pointing elsewhere, does not constitute adequate discoverability — especially when the "correct" answer violates the software's documented best practice.

4. The 80/20 failure pattern matches "task unclear"

From the Task Review Guide failure analysis flow:

"Agent does wrong thing → Task unclear → Clarify task.yaml deliverables"

8/10 agents demonstrated full Keycloak competency — they found Keycloak, understood protocol mappers, configured back-channel logout — but chose the wrong realm because the environment's strongest signals pointed there. This is the textbook pattern for a specification gap, not a skill gap.

Recommended Fix

Add to task.yaml, in the Keycloak requirements:

"The authentication service currently authenticates against Keycloak's master realm."

This:

Names the realm (satisfying "agent should never have to guess a name to pass a check")
Still requires the agent to find the correct client, configure the jti mapper, set up back-channel logout, and verify the configuration
Does not enumerate specific client settings, ports, or implementation steps
Preserves all genuine engineering difficulty

Alternative: Seed this information into a Gitea issue or wiki page instead of task.yaml, which would preserve more investigation difficulty while still making the realm discoverable through a documented in-environment channel.

Note on score impact: Even with a realm hint, the mean score is estimated at ~0.62-0.69 — still below the 0.70 threshold. The decoy Redis trap (1/10 pass), persistence requirements (5/10 pass), and logout interface (7/10 pass) provide ample genuine difficulty.

Eliminated Position: Fix Grader to Accept Both Realms

This was considered and rejected. The authentication service hardcodes KEYCLOAK_REALM=master — a jti mapper on nebula-app in the nebula realm is functionally inert and would never affect tokens the auth service issues. Accepting nebula-realm configurations would credit agents for non-functional changes, undermining evaluation integrity.

Prior Review Resolution

Dylan's review on 2026-03-17 identified one blocking issue and several non-blocking recommendations.

Blocking Issue: Specify POST /logout endpoint path

Original requirement: "task.yaml line 25 — specify POST /logout endpoint path. Blocking because 6/10 eval agents chose /revoke due to ambiguity."

Author's approach (v25): Indirect specification via TODO injection:

setup.sh lines 478-495: Injects # TODO: Implement POST /logout endpoint for token revocation into auth service source code
task.yaml line 25: "Users can explicitly revoke their tokens. There was somewhere that developer left To do in the source code regarding that. (logout)"

Result: Pass rate on logout_works improved to 7/10. The 3 remaining failures are agents who built body-based endpoints (not a naming issue), so the /logout path discovery is largely solved.

Assessment: The TODO injection approach is effective but the task.yaml phrasing is awkward. The original suggestion of explicit specification ("users can explicitly revoke their tokens by POSTing to /logout") would be cleaner, but the current approach is functional. The remaining 3/10 failures are from interface convention mismatch (body vs header), not endpoint naming.

Non-Blocking Recommendations

Recommendation	Status
Pin redis wheel version in Dockerfile	✅ Fixed (`redis==5.2.1`)
Remove unused PyJWT/python-jose from Dockerfile	✅ Fixed
Hint toward Keycloak master realm	❌ Now a required change (see above)
grader.py line 631 early return bug	❌ Not addressed (latent, not seen in evals)
setup.sh line 182 git checkout robustness	❌ Not addressed (non-blocking)
solution.sh self-test intensity (1 replay vs grader's 12)	❌ Not addressed (non-blocking)

Additional Notes

Zero-Variance Checks

malformed_logout_handling and service_stable both pass 10/10 with zero variance:

malformed_logout_handling passes because any endpoint returning 422 for missing body satisfies "4xx and not 500"
service_stable passes because the base platform stays healthy regardless of agent changes

At 0.125 weight each, these don't enable meaningful score inflation (max free score = 0.25 < 0.3 threshold). Non-blocking but worth noting as candidates for consolidation in a future iteration. If the Keycloak realm fix pushes the mean score too close to 0.70, we can collapse these two always-passing subscores into a single subscore — reducing from 8 checks to 7 and lowering the free-score floor from 0.25 to ~0.14.

Grader Line 631 Latent Bug

In check_keycloak_backchannel_logout, if the first client with a matching backchannel.logout.url has session.required != "true", the function returns False immediately without checking other clients. Not hitting agents in current evals. Non-blocking but worth fixing for robustness.

Task ID Typo

oidc-token-reply-mitigation should be oidc-token-replay-mitigation. Cosmetic but reflects poorly. Non-blocking.

arubis/oidc-token-replay-review-v25.md

Select an option

No results found

Select an option

No results found

Task Review: `oidc-token-replay-mitigation` (v25)

Verdict: NEEDS_WORK

Required before approval

Recommended (non-blocking)

Eval Summary

Per-Subscore Breakdown

Full Score Matrix

Failure Mode Analysis

Failure Categorization

Key Behavioral Insight: The Decoy Redis "False Success" Signal

Run 9: The Perfect Score

Keycloak Realm: Decision and Rationale

The Problem

Why the Realm Must Be Specified

1. There is no reasonable default realm to assume

2. Every in-environment signal points to `nebula`

3. The rubric requires discoverability, and this doesn't meet the bar

4. The 80/20 failure pattern matches "task unclear"

Recommended Fix

Eliminated Position: Fix Grader to Accept Both Realms

Prior Review Resolution

Blocking Issue: Specify POST /logout endpoint path

Non-Blocking Recommendations

Additional Notes

Zero-Variance Checks

Grader Line 631 Latent Bug

Task ID Typo

arubis/oidc-token-replay-review-v25.md

Task Review: oidc-token-replay-mitigation (v25)

Verdict: NEEDS_WORK

Required before approval

Recommended (non-blocking)

Eval Summary

Per-Subscore Breakdown

Full Score Matrix

Failure Mode Analysis

Failure Categorization

Key Behavioral Insight: The Decoy Redis "False Success" Signal

Run 9: The Perfect Score

Keycloak Realm: Decision and Rationale

The Problem

Why the Realm Must Be Specified

1. There is no reasonable default realm to assume

2. Every in-environment signal points to nebula

3. The rubric requires discoverability, and this doesn't meet the bar

4. The 80/20 failure pattern matches "task unclear"

Recommended Fix

Eliminated Position: Fix Grader to Accept Both Realms

Prior Review Resolution

Blocking Issue: Specify POST /logout endpoint path

Non-Blocking Recommendations

Additional Notes

Zero-Variance Checks

Grader Line 631 Latent Bug

Task ID Typo

Task Review: `oidc-token-replay-mitigation` (v25)

2. Every in-environment signal points to `nebula`