| UUID | 3b31f3b1-2f04-4033-acbe-e3f9dd8c6343 |
| Version | 25 |
| Category | Security & Authentication |
| Difficulty | Hard |
| Reviewer | Dylan Fitzgerald |
| Status | NEEDS_WORK — one required change (Keycloak realm specification), plus minor recommendations |
| Discord thread | https://discord.com/channels/1427397917685321919/1479911866171130008 |
The task is well-designed with genuine difficulty from the decoy Redis trap, NetworkPolicy reconciliation loop, and persistence requirements. Score distribution (mean 0.55) is healthy. Solution passes. One required change: the Keycloak realm must be specified — the current design punishes agents for correctly applying Keycloak best practice.
- Specify the Keycloak realm. The task must tell agents which realm to configure — either in task.yaml or via a discoverable in-environment artifact. See Keycloak Realm: Decision and Rationale for full analysis.
- grader.py line 631 — fix early return in
check_keycloak_backchannel_logout(latent bug, not hitting agents yet) - setup.sh line 182 — use
git checkout origin/master --instead ofHEAD~1for revert robustness - solution.sh lines 757-767 — self-tests send 1 replay request; grader sends 12; closer mirroring would catch flaky solutions
- Task ID typo:
oidc-token-reply-mitigation→oidc-token-replay-mitigation - Consider consolidating the two zero-variance checks (
malformed_logout_handling,service_stable) in a future iteration
| Metric | Value |
|---|---|
| Solution passes | ✅ Yes (1.0, all 8 checks) |
| Mean score | 0.55 (threshold: <0.70 ✅) |
| Score range | 0.25 – 1.0 |
| Full pass (1.0) | 1/10 (10%) |
| Scores | 0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0 |
xychart-beta
title "Score Distribution (v25, 10 rollouts)"
x-axis ["Run 6", "Run 10", "Run 3", "Run 5", "Run 8", "Run 1", "Run 2", "Run 4", "Run 7", "Run 9"]
y-axis "Score" 0 --> 1.0
bar [0.25, 0.25, 0.5, 0.5, 0.5, 0.625, 0.625, 0.625, 0.625, 1.0]
| Subscore | Pass Rate | Variance | Analysis |
|---|---|---|---|
logout_works |
7/10 (70%) | ✅ | TODO hint working well |
replay_rejected |
7/10 (70%) | ✅ | Correlates with logout_works |
redis_revocation_ttl |
1/10 (10%) | ✅ | Decoy Redis trap — genuine difficulty |
malformed_logout_handling |
10/10 (100%) | ❌ none | Always passes (see note) |
service_stable |
10/10 (100%) | ❌ none | Always passes (see note) |
keycloak_jti_mapper |
2/10 (20%) | ✅ | Realm specification gap — see required change |
keycloak_backchannel_logout |
2/10 (20%) | ✅ | Realm specification gap — see required change |
survives_restart |
5/10 (50%) | ✅ | Good discriminator |
6/8 subscores have variance — healthy discrimination overall.
| Run | Score | logout | replay | redis_ttl | malformed | stable | kc_jti | kc_bcl | restart |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.625 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
| 2 | 0.625 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
| 3 | 0.500 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| 4 | 0.625 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
| 5 | 0.500 | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
| 6 | 0.250 | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| 7 | 0.625 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
| 8 | 0.500 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| 9 | 1.000 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| 10 | 0.250 | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
flowchart TD
A[Agent starts task] --> B[Reads password manager & secrets]
B --> C{Which Redis?}
C -->|9/10 agents| D[redis-decoy<br/>redis.bleater.svc<br/>password: redis-default-password]
C -->|1/10 agents| E[bleater-redis-0<br/>via CACHE_BACKEND_AUTH ConfigMap<br/>password: r3d1s-s3cur3-p@ss!2024]
D --> F[Replay works BUT<br/>grader checks bleater-redis-0<br/>→ redis_revocation_ttl FAILS]
E --> G[redis_revocation_ttl PASSES]
A --> H[Reads source code]
H --> I{Logout endpoint interface?}
I -->|7/10 agents| J[POST /logout with<br/>Authorization: Bearer header]
I -->|3/10 agents| K[POST /logout with<br/>JSON request body]
J --> L[logout_works PASSES]
K --> M[422 Field Required<br/>→ logout_works FAILS<br/>→ cascading failures]
A --> N[Configures Keycloak]
N --> O{Which realm?}
O -->|8/10 agents| P[nebula realm<br/>nebula-app client]
O -->|2/10 agents| Q[master realm<br/>admin-cli client]
P --> R[Grader only checks master<br/>→ keycloak checks FAIL]
Q --> S[keycloak checks PASS]
style D fill:#f96,stroke:#333
style E fill:#6f6,stroke:#333
style K fill:#f96,stroke:#333
style J fill:#6f6,stroke:#333
style P fill:#f96,stroke:#333
style Q fill:#6f6,stroke:#333
| Failure Mode | Runs Affected | Category | Notes |
|---|---|---|---|
| Decoy Redis trap | 1,2,3,4,5,6,7,8,10 (9/10) | Genuine | Deliberate misdirection. Real Redis discoverable via CACHE_BACKEND_AUTH ConfigMap. Decoy is non-durable (--save "" --appendonly no), violating task's durability requirement. |
| Keycloak realm confusion | 1,2,3,4,6,7,8,10 (8/10) | Specification gap | Grader hardcodes master; agents follow Keycloak best practice toward nebula realm. See required change. |
| Logout body vs header | 5,6,10 (3/10) | Genuine + minor spec gap | TODO hint improved from prior versions (was 6/10). Remaining agents implement body-based API. |
| Non-persistent changes | 3,5,6,8,10 (5/10) | Genuine | Agents modify pods directly instead of Helm/Git. Tests real DevOps skill. |
| NetworkPolicy reconciliation | Various | Genuine | 30s reconciliation loop. All agents discover it; best agents counteract it. |
Agents using the decoy Redis observe working revocation during their own testing — tokens actually get rejected. But the grader checks bleater-redis-0, which is empty. This explains why replay_rejected passes at 70% while redis_revocation_ttl passes at only 10%. These are not independent failures — they're both consequences of the decoy trap, but only the Redis TTL check catches it.
The sole 1.0 run (run 9, 760 messages) distinguished itself by:
- Reading
CACHE_BACKEND_AUTHin the auth service ConfigMap early → found real Redis - Exploring both Keycloak realms → configured
admin-cliin master - Running a counter-deletion loop for the NetworkPolicy AND updating Helm charts
- Persisting all changes through Git/Helm (not direct pod patching)
The grader hardcodes KC_REALM = "master" (grader.py line 38). The task says "configured on a client" without specifying a realm. 8/10 agents configure the nebula realm (which has nebula-app, described as "Main Nebula application client"). 2/10 configure the master realm. The grader rejects all nebula-realm configurations.
We evaluated three positions through adversarial debate (three Opus subagents arguing each side, with cross-examination). Position B — specify the realm — prevailed, for these reasons:
Keycloak realms are always environment-specific. Every deployment creates custom realms for its applications. The only universal realm (master) is explicitly documented as not for application use:
"It is recommended that you do not use the master realm to manage the users and applications in your organization. Keep the master realm as a place for super admins to create and manage the realms in your system." — Keycloak Server Administration Guide
"The Master realm has elevated privileges that are necessary for administering the entire Keycloak instance. By adding regular users to this realm, you could inadvertently grant them access to sensitive areas." — Skycloak: Secure Your Keycloak's Master Realm
"Avoid using the master realm for direct integration with the application, keep it for management only, and create a realm for all application integrations." — How to Configure Keycloak Realms and Clients
An agent that chooses nebula over master is not guessing — it is applying correct domain knowledge. The auth service's use of the master realm (via a Python source code default os.getenv("KEYCLOAK_REALM", "master")) is an anti-pattern, not a discoverable clue.
We audited every source of Keycloak information available to agents inside the environment:
| Source | What it shows | Points to |
|---|---|---|
Password manager (passwords.devops.local) |
Keycloak URL, admin creds — no realm mentioned | Neither |
Gitea wiki (Infrastructure-Services.md) |
"Keycloak - SSO and identity provider" — no realm | Neither |
| Keycloak admin console | nebula realm with nebula-app ("Main Nebula application client"), users, etc. |
nebula |
| Keycloak admin console | master realm with default system clients (admin-cli, account) |
Administration only |
Pod environment (kubectl exec -- env) |
No KEYCLOAK_REALM env var present |
Neither |
Auth service source (/app/main.py line 54) |
os.getenv("KEYCLOAK_REALM", "master") — Python default |
master (buried, anti-pattern) |
The only signal pointing to master is a Python code default in a file the agent reads for a different purpose. It is not set as a Kubernetes env var, ConfigMap entry, or Secret — kubectl exec -- env shows nothing for KEYCLOAK_REALM. Meanwhile, the Keycloak admin UI prominently presents the nebula realm with a client explicitly described as the main application client.
From the Task Author Rubric:
"If the grader hardcodes resource names, ports, or labels, those must be discoverable by the agent. The agent should never have to guess a name to pass a check."
"For every grader check, you should be able to point to where the agent would learn about that requirement — from the prompt, from Gitea issues, from wiki docs, from cluster state, or because the E2E test implies it."
The grader hardcodes KC_REALM = "master". The only discoverable path to this value is a Python source code default that contradicts Keycloak's own documentation. One competing signal in one file, weighed against an entire realm architecture pointing elsewhere, does not constitute adequate discoverability — especially when the "correct" answer violates the software's documented best practice.
From the Task Review Guide failure analysis flow:
"Agent does wrong thing → Task unclear → Clarify task.yaml deliverables"
8/10 agents demonstrated full Keycloak competency — they found Keycloak, understood protocol mappers, configured back-channel logout — but chose the wrong realm because the environment's strongest signals pointed there. This is the textbook pattern for a specification gap, not a skill gap.
Add to task.yaml, in the Keycloak requirements:
"The authentication service currently authenticates against Keycloak's master realm."
This:
- Names the realm (satisfying "agent should never have to guess a name to pass a check")
- Still requires the agent to find the correct client, configure the jti mapper, set up back-channel logout, and verify the configuration
- Does not enumerate specific client settings, ports, or implementation steps
- Preserves all genuine engineering difficulty
Alternative: Seed this information into a Gitea issue or wiki page instead of task.yaml, which would preserve more investigation difficulty while still making the realm discoverable through a documented in-environment channel.
Note on score impact: Even with a realm hint, the mean score is estimated at ~0.62-0.69 — still below the 0.70 threshold. The decoy Redis trap (1/10 pass), persistence requirements (5/10 pass), and logout interface (7/10 pass) provide ample genuine difficulty.
This was considered and rejected. The authentication service hardcodes KEYCLOAK_REALM=master — a jti mapper on nebula-app in the nebula realm is functionally inert and would never affect tokens the auth service issues. Accepting nebula-realm configurations would credit agents for non-functional changes, undermining evaluation integrity.
Dylan's review on 2026-03-17 identified one blocking issue and several non-blocking recommendations.
Original requirement: "task.yaml line 25 — specify POST /logout endpoint path. Blocking because 6/10 eval agents chose /revoke due to ambiguity."
Author's approach (v25): Indirect specification via TODO injection:
setup.shlines 478-495: Injects# TODO: Implement POST /logout endpoint for token revocationinto auth service source codetask.yamlline 25:"Users can explicitly revoke their tokens. There was somewhere that developer left To do in the source code regarding that. (logout)"
Result: Pass rate on logout_works improved to 7/10. The 3 remaining failures are agents who built body-based endpoints (not a naming issue), so the /logout path discovery is largely solved.
Assessment: The TODO injection approach is effective but the task.yaml phrasing is awkward. The original suggestion of explicit specification ("users can explicitly revoke their tokens by POSTing to /logout") would be cleaner, but the current approach is functional. The remaining 3/10 failures are from interface convention mismatch (body vs header), not endpoint naming.
| Recommendation | Status |
|---|---|
| Pin redis wheel version in Dockerfile | ✅ Fixed (redis==5.2.1) |
| Remove unused PyJWT/python-jose from Dockerfile | ✅ Fixed |
| Hint toward Keycloak master realm | ❌ Now a required change (see above) |
| grader.py line 631 early return bug | ❌ Not addressed (latent, not seen in evals) |
| setup.sh line 182 git checkout robustness | ❌ Not addressed (non-blocking) |
| solution.sh self-test intensity (1 replay vs grader's 12) | ❌ Not addressed (non-blocking) |
malformed_logout_handling and service_stable both pass 10/10 with zero variance:
malformed_logout_handlingpasses because any endpoint returning 422 for missing body satisfies "4xx and not 500"service_stablepasses because the base platform stays healthy regardless of agent changes
At 0.125 weight each, these don't enable meaningful score inflation (max free score = 0.25 < 0.3 threshold). Non-blocking but worth noting as candidates for consolidation in a future iteration. If the Keycloak realm fix pushes the mean score too close to 0.70, we can collapse these two always-passing subscores into a single subscore — reducing from 8 checks to 7 and lowering the free-score floor from 0.25 to ~0.14.
In check_keycloak_backchannel_logout, if the first client with a matching backchannel.logout.url has session.required != "true", the function returns False immediately without checking other clients. Not hitting agents in current evals. Non-blocking but worth fixing for robustness.
oidc-token-reply-mitigation should be oidc-token-replay-mitigation. Cosmetic but reflects poorly. Non-blocking.