Task: 9cd2b86e-15bf-4c76-8685-efffd1114c8f
Version: v16
Task ID: argocd_hook_reconciliation_deadlock
Category: platform-eng
Author: hafis_83579
Reviewer: Dylan (primary)
Discord: https://discord.com/channels/1427397917685321919/1483534668413407262
This review is based on code analysis and v11 transcripts only — no v16 evaluations exist yet. Before requesting the next review, please run 8+ biggie-nebula evals on the updated version so the reviewer has fresh data to work with.
What needs to happen:
- Add a wiki page documenting configmap conventions. task.yaml references
/tmp/wiki.md, but nothing creates it — and the existing Gitea wiki doesn't cover these configmaps either. Without documentation, the decoy configmap check is undiscoverable (0/10 agents pass it). Add a page todata/gitea/wikis/and point the task prompt at the Gitea wiki instead. (Details below.) - Run 8+ biggie-nebula evals on the fixed version before requesting the next review.
What's working well: The Dockerfile answer-leakage is fixed, the core troubleshooting concept is strong, and the migration job misdirection is clever. The critical group shows genuine 40% variance from a real behavioral split.
| Metric | Value |
|---|---|
| Mean score | 0.49 (15 rollouts) |
| Full passes | 0 |
| Scores | 8× 0.333, 7× 0.667 |
| Group | Pass rate | Signal |
|---|---|---|
system (5 config checks) |
10/10 | No variance |
discipline (decoys + pod stability) |
0/10 | No variance — permanent penalty |
critical (migration + orphan + deploys) |
4/10 | Only source of variance |
Scores will shift significantly after the blocking issue is fixed.
task.yaml says "All required conventions are documented in /tmp/wiki.md." Nothing creates this file — not setup.sh, not the Dockerfile, not the base image. The existing Gitea wiki pages contain zero mentions of any of these configmaps.
This matters because the grader checks that two "decoy" configmaps (rollback-policy, sync-override) remain untouched. But without documentation, agents have no way to distinguish them from the five configmaps that need fixing. All seven sit in the namespace with no distinguishing labels, annotations, or pod references:
| ConfigMap | Current value | Grader expects | Action | Referenced by pods? |
|---|---|---|---|---|
appproject-config.syncWindow |
deny |
allow |
Fix | No |
webhook-policy.dryRun |
true |
false |
Fix | No |
feature-flags.migrationEnabled |
false |
true |
Fix | No |
rollback-policy.autoRollback |
disabled |
disabled |
Don't touch | No |
sync-override.enabled |
false |
false |
Don't touch | No |
Every agent patches the decoys because they look like broken config — autoRollback=disabled and enabled=false are reasonable things to "fix" in an ArgoCD deployment recovery. This is an artificial failure, not genuine difficulty.
Fix:
- Add a new wiki page to
data/gitea/wikis/(e.g.,Deployment-Conventions.md) documenting the ArgoCD simulation conventions — which configmaps are deployment-controlling vs informational/read-only. It doesn't need to reveal exact values, just classify them. - Link it from
data/gitea/wikis/Home.md - In task.yaml, replace the
/tmp/wiki.mdreference: "Deployment conventions are documented in the Bleater wiki at gitea.devops.local. Review the wiki before making changes."
This uses existing Nebula infrastructure (wiki pages are seeded from data/gitea/wikis/*.md via the bootstrap pipeline) and is more realistic — a real SRE checks docs before patching configmaps. Once the wiki classifies configmaps, the decoy check becomes a legitimate test of whether the agent reads documentation before acting.
1. migration_valid design ambiguity. The grader accepts the original no-op job (already succeeded=1) as valid if left in place. Agents who correctly diagnose it as broken and delete it — intending to recreate — score worse than agents who leave it alone. Consider requiring explicit job recreation (e.g., check job logs) or decoupling from job existence.
2. Grader wait times. time.sleep(5) before checks; review guide minimum is 30s for pod readiness, 1 min for deployment rollout. The critical group's 40% variance may partly reflect timing. Consider a polling loop for deployment readiness.
3. Scoring granularity. All-or-nothing groups produce only 4 possible scores (0, 0.333, 0.667, 1.0). This will matter more or less depending on how the score distribution looks after the wiki fix. Revisit after re-eval.
4. Trivial inversion checks. Three of the five system checks (sync_fixed, webhook_fixed, feature_enabled) are currently trivial — the agent just flips the obvious wrong value with no cluster-state confirmation. The other two (schema_applied, migration_marked) require real root-cause tracing through deployment env vars. The wiki page may change this — if the wiki documents expected states, the agent needs to find and read it first. If these checks still show 100% pass after re-eval, consider collapsing the three trivial ones into a single subscore.
- Dockerfile fixed — v16 no longer COPYs solution.sh or grader.py. No v11 transcripts show agents reading these files.
- Core concept is strong — Multi-layered failure scenario (orphaned job + broken migration + misconfigured policies) is realistic and cohesive.
- Migration misdirection is clever — A job that exits 0 but doesn't apply changes tests genuine understanding. The "delete-then-forget" pattern it reveals is a real agent capability gap.
criticalgroup shows real variance — 40% pass rate from a legitimate behavioral split, not timing or grader issues.
| Criterion | Threshold | Status |
|---|---|---|
| Solvable | solution.sh passes | PASS (1.0) |
| Challenging | Mean score < 0.70 | PASS (0.49 on v11) — driven by artificial failures; must re-eval after fixes |
| Substantial | ≥ 4 hours | PASS |