Review: projected-volume-timeout-conflict v20

Verdict: APPROVE

What's Working

The task concept is strong and the v20 execution delivers on it. Projected-volume source ordering maps to real incidents, the multi-key ConfigMap conflict creates genuine reasoning difficulty, and the GitOps reconciliation requirement (~28 handoff files across 5 directories in JSON, env, YAML, INI, and TSV formats) adds meaningful scope. The grader uses 8 equally-weighted subscores that represent working milestones. Setup, solution, and Dockerfile are all clean.

Eval Results

Version	Config	Runs	Scores	Mean	Threshold
v12	16vCPU, default turns	5	1.0, 1.0, 1.0, 1.0, 1.0	1.0	Fails
v13	16vCPU, default turns	5	1.0, 0.14, 1.0, 1.0, 0.14	0.66	Fails
v15	16vCPU, default turns	10	0.14, 1, 0.14, 1, 0.14, 0.14, 1, 1, 0.14, 0.14	0.49	Fails (fairness)
v20	16vCPU, default turns	5	0.25, 0.75, 1, 0, 0	0.40	Passes

Docker difficulty threshold: mean < 0.50. v20 mean: 0.40.

v12-v13 had no key conflict between ConfigMaps — ordering was functionally arbitrary. v15 flipped the grader's expected ordering to game the threshold but still had the fairness issue. v20 addresses the root cause.

What Changed in v20

The author implemented the three recommendations from our prior review:

1. Key conflict added. like-service-config now defines telemetry.enabled=false alongside database.connection.timeout=5. This makes only one source ordering correct:

[platform-defaults, shared-hotfix, like-service-config, runtime-flags] → timeout=5, telemetry=true (correct)
[platform-defaults, shared-hotfix, runtime-flags, like-service-config] → timeout=5, telemetry=false (wrong)

Agents that get the ordering wrong see observably broken pod behavior.

2. auth-service-projected nerfed. Now uses completely different ConfigMaps: [platform-defaults, auth-service-config, auth-service-runtime-flags]. No shared-hotfix, no like-service-config, no bleater-runtime-flags. The ordering pattern is non-transferable. Agents that check it learn about projected volumes generally but can't copy the specific ordering.

3. Free floor eliminated. deployment_structure_preserved is now called inside live_runtime_behavior_fixed rather than being a separate subscore. The anti-cheat checks still run, but an agent that gets the ordering wrong scores 0.0 instead of 0.14.

Additional improvements:

Two new subscores: winner_metadata_refreshed (YAML winner matrix + INI release digest) and resolution_matrix_refreshed (TSV resolution matrix + env current-order file). These require agents to record which ConfigMap wins per key -- demonstrating understanding of the conflict, not just the ordering.
task.yaml now says "reads the intended timeout value" instead of revealing the exact value "5". Agents must discover target values from the ConfigMaps.
Broken-state metadata files (winner-matrix.yaml, resolution-matrix.tsv) are seeded describing the current wrong state, creating a realistic but misleading signal that caught 2/5 agents.

Failure Analysis

Three independent failure modes -- this is a healthy task with multi- dimensional difficulty:

Run	Score	Failure mode
1	0.25	Correct ordering, incomplete artifact coverage (~11 of ~28 files)
2	0.75	Correct ordering, near-complete coverage, TSV tool selection error
3	0.0	Wrong ordering -- misread broken-state metadata as target state
4	1.0	Perfect -- explicit two-constraint reasoning, comprehensive coverage
5	0.0	Wrong ordering -- same metadata trap as run 3

Ordering failures (runs 3, 5)

Both agents read the seeded resolution-matrix.tsv and winner-matrix.yaml (which describe the broken state: timeout winner = platform-defaults, value 30) and concluded "the intended state is what these files say." This led them to reorder the non-service ConfigMaps while leaving like-service-config in the wrong position. Neither agent traced both keys through like-service-config to realize the telemetry conflict forces a specific ordering.

This is genuine difficulty: the broken-state metadata is realistic (GitOps artifacts describing current state is standard practice), and the two-constraint reasoning (timeout AND telemetry must both be correct) requires understanding projected-volume semantics across multiple ConfigMaps.

Artifact coverage failure (run 1)

Got the ordering right and verified the pod, but only updated ~11 of ~28 handoff files before declaring done. Didn't enumerate all files before starting updates. Passed live_runtime_behavior_fixed and canonical_manifest_fixed (2/8 = 0.25).

TSV format failure (run 2)

Correct ordering, covered almost everything, but used str_replace_editor for TSV files, which wrote literal \t characters instead of real tab bytes (ASCII 0x09). The grader's csv.DictReader(delimiter='\t') couldn't parse them. Failed release_bundle_artifacts_refreshed and resolution_matrix_refreshed (6/8 = 0.75). The passing run used bash heredoc (cat > file << 'EOF') which produces real tabs.

Grader Quality

No issues found. The grader is well-constructed:

8 equal-weight subscores, all testing working milestones
deployment_structure_preserved correctly folded into live_runtime_behavior_fixed
The two new checks (winner_metadata_refreshed, resolution_matrix_refreshed) correctly validate that agents understand the per-key conflict resolution
EXPECTED_SOURCE_ORDER is the only ordering that produces both timeout=5 and telemetry=true -- the grader is correct to enforce exactly this
No free floor -- wrong ordering produces 0.0
Anti-cheat checks (deployment UIDs, ConfigMap resourceVersions) still active

Eval Config Reminder

When re-running evals:

Model: biggie-nebula
Agent type: meteor
Machine type: --machine-type e2-custom-16-32768 (16vCPU/32GiB)
Max turns: default (800) -- do not set --max-turns 15

arubis/projected-volume-timeout-conflict-review.md

Select an option

No results found