Task UUID: a4bc3f9c-fe11-4473-960c-2717c39f2417
Version: v30 (2026-04-06)
Backend: biggie-nebula (teapot)
Reviewer: Dylan Fitzgerald
Prior review: v18 (2026-04-02)
Both v18 blockers are resolved. The task produces genuine discriminative signal — a clean
60/40 split on email_e2e driven by whether agents correctly discover the DKIM key_path
from the mounted Secret. Mean is 0.80 against a ≤0.85 teapot threshold. Task design is
strong: layered sabotage, ArgoCD GitOps pressure, and diverse failure modes across runs.
Two non-blocking suggestions for future improvement are noted below.
| Metric | Value |
|---|---|
| Rollouts | 10 |
| Scores | 1.0, 0.5, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 1.0 |
| Mean | 0.80 (threshold ≤0.85 — passes) |
| Subscore | Weight | Pass rate | Runs |
|---|---|---|---|
email_e2e |
0.50 | 6/10 (60%) | 1,0,1,1,0,1,0,1,0,1 |
mail_security |
0.50 | 10/10 (100%) | 1,1,1,1,1,1,1,1,1,1 |
mail_security passes all 10 runs via structural proxy checks (string presence in config,
pod status). We investigated making it functional (send through the DKIM relay) but found all
10 agents correctly fix the relay's key material — a functional test would also be 10/10. The
score distribution is effectively a bimodal 0.5/1.0 split gated by email_e2e, which is where
the real discrimination lives.
Blocker 1: Missing DKIM Discovery Signal — Resolved. Setup now writes a broken DKIM block
with wrong selector (legacy) and wrong key_path (old-signing.key), giving agents a clear
breadcrumb. The correct fix (selector default, key_path dkim-private.pem) is exactly where
failing agents diverge from passing ones.
Blocker 2: Grafana Dashboard Subscore — Resolved. Removed entirely. Grader now has two clean subscores at equal 0.50 weight.
The DKIM key_path discriminator is genuine and diverse. The three genuine email_e2e
failures aren't the same mistake repeated — each agent chose a different creative-but-wrong
key management strategy:
| Run | Strategy | Why it fails |
|---|---|---|
| 2 | Generated new keypair, used dkim-private.key |
Non-standard filename; also reordered auth entries, confusing grader's sender selection |
| 5 | Generated new keypair, added default.key to Secret |
Non-standard name; ArgoCD annotation diff triggered reconcile race |
| 7 | Correct key, but copied to PVC via init container | PVC copy goes stale vs. direct Secret mount under ArgoCD reconciliation |
Run 9 used the correct approach (same as all passing runs) but experienced a tool restart mid-run that caused ArgoCD to revert configs. The agent recovered, but the grader likely hit a transient timing window. This is environmental, not a task design issue.
Other design strengths:
- Layered sabotage (DNS poison, selector breakage, domain config, DKIM keys, TLS) requires roughly sequential peeling — you can't test SMTP until DNS and selectors work.
- ArgoCD selfHeal with 10-second reconciliation forces agents into the GitOps workflow.
kubectl applygets reverted; they must discover the Gitea repo and push there. - The relay's obfuscated deployment (base64+zlib in a ConfigMap) prevents agents from
casually reading relay.py, while
_ck()provides a real cryptographic gate. - The inbound policy allowlist (
$(authorized_senders) = legacy.bleater.internalin a separate ConfigMap) is a subtle second-order dependency that rewards thorough investigation.
These are improvements worth considering for a future iteration, not blockers to acceptance.
-
Instrument grader feedback on
email_e2efailure. The feedback field is currently null for all failing runs, which makes it hard to distinguish skill-based failures from timing flakiness. Capturing which sub-step failed (SMTP connect, SMTP send, IMAP auth, IMAP read, security@ check) would improve maintainability and inform future grader refinement. -
Increase IMAP retry window. The grader waits 5s then retries 3x at 3s intervals (~14s total). The multi-hop delivery path (submission → DKIM sign → relay → verify → smtp → mailbox) crosses two pods. Bumping to 8-10s initial wait or adding a 4th retry would reduce the timing flakiness that likely caused run 9's false failure.
| Run | Score | email_e2e | mail_security | key_path | Notes |
|---|---|---|---|---|---|
| 1 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass |
| 2 | 0.5 | 0 | 1 | dkim-private.key |
New keypair, wrong filename, auth entry reorder |
| 3 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass |
| 4 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass |
| 5 | 0.5 | 0 | 1 | default.key |
New keypair, non-standard Secret entries |
| 6 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass (security@ placed first — still works) |
| 7 | 0.5 | 0 | 1 | /data/dkim/dkim-private.pem |
PVC path via init container copy |
| 8 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass |
| 9 | 0.5 | 0 | 1 | dkim-private.pem |
Correct approach; tool restart + ArgoCD timing |
| 10 | 1.0 | 1 | 1 | dkim-private.pem |
Clean pass |