Review: maddy-spf-dkim-domain-migration v30

Task UUID: a4bc3f9c-fe11-4473-960c-2717c39f2417 Version: v30 (2026-04-06) Backend: biggie-nebula (teapot) Reviewer: Dylan Fitzgerald Prior review: v18 (2026-04-02)

Verdict: APPROVE

Both v18 blockers are resolved. The task produces genuine discriminative signal — a clean 60/40 split on email_e2e driven by whether agents correctly discover the DKIM key_path from the mounted Secret. Mean is 0.80 against a ≤0.85 teapot threshold. Task design is strong: layered sabotage, ArgoCD GitOps pressure, and diverse failure modes across runs.

Two non-blocking suggestions for future improvement are noted below.

Eval Results

Metric	Value
Rollouts	10
Scores	1.0, 0.5, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 1.0
Mean	0.80 (threshold ≤0.85 — passes)

Subscore	Weight	Pass rate	Runs
`email_e2e`	0.50	6/10 (60%)	1,0,1,1,0,1,0,1,0,1
`mail_security`	0.50	10/10 (100%)	1,1,1,1,1,1,1,1,1,1

mail_security passes all 10 runs via structural proxy checks (string presence in config, pod status). We investigated making it functional (send through the DKIM relay) but found all 10 agents correctly fix the relay's key material — a functional test would also be 10/10. The score distribution is effectively a bimodal 0.5/1.0 split gated by email_e2e, which is where the real discrimination lives.

Previous Blockers — Status

Blocker 1: Missing DKIM Discovery Signal — Resolved. Setup now writes a broken DKIM block with wrong selector (legacy) and wrong key_path (old-signing.key), giving agents a clear breadcrumb. The correct fix (selector default, key_path dkim-private.pem) is exactly where failing agents diverge from passing ones.

Blocker 2: Grafana Dashboard Subscore — Resolved. Removed entirely. Grader now has two clean subscores at equal 0.50 weight.

What's Working Well

The DKIM key_path discriminator is genuine and diverse. The three genuine email_e2e failures aren't the same mistake repeated — each agent chose a different creative-but-wrong key management strategy:

Run	Strategy	Why it fails
2	Generated new keypair, used `dkim-private.key`	Non-standard filename; also reordered auth entries, confusing grader's sender selection
5	Generated new keypair, added `default.key` to Secret	Non-standard name; ArgoCD annotation diff triggered reconcile race
7	Correct key, but copied to PVC via init container	PVC copy goes stale vs. direct Secret mount under ArgoCD reconciliation

Run 9 used the correct approach (same as all passing runs) but experienced a tool restart mid-run that caused ArgoCD to revert configs. The agent recovered, but the grader likely hit a transient timing window. This is environmental, not a task design issue.

Other design strengths:

Layered sabotage (DNS poison, selector breakage, domain config, DKIM keys, TLS) requires roughly sequential peeling — you can't test SMTP until DNS and selectors work.
ArgoCD selfHeal with 10-second reconciliation forces agents into the GitOps workflow. kubectl apply gets reverted; they must discover the Gitea repo and push there.
The relay's obfuscated deployment (base64+zlib in a ConfigMap) prevents agents from casually reading relay.py, while _ck() provides a real cryptographic gate.
The inbound policy allowlist ($(authorized_senders) = legacy.bleater.internal in a separate ConfigMap) is a subtle second-order dependency that rewards thorough investigation.

Non-Blocking Suggestions

These are improvements worth considering for a future iteration, not blockers to acceptance.

Instrument grader feedback on email_e2e failure. The feedback field is currently null for all failing runs, which makes it hard to distinguish skill-based failures from timing flakiness. Capturing which sub-step failed (SMTP connect, SMTP send, IMAP auth, IMAP read, security@ check) would improve maintainability and inform future grader refinement.
Increase IMAP retry window. The grader waits 5s then retries 3x at 3s intervals (~14s total). The multi-hop delivery path (submission → DKIM sign → relay → verify → smtp → mailbox) crosses two pods. Bumping to 8-10s initial wait or adding a 4th retry would reduce the timing flakiness that likely caused run 9's false failure.

Appendix: Per-Run Detail

Run	Score	email_e2e	mail_security	key_path	Notes
1	1.0	1	1	`dkim-private.pem`	Clean pass
2	0.5	0	1	`dkim-private.key`	New keypair, wrong filename, auth entry reorder
3	1.0	1	1	`dkim-private.pem`	Clean pass
4	1.0	1	1	`dkim-private.pem`	Clean pass
5	0.5	0	1	`default.key`	New keypair, non-standard Secret entries
6	1.0	1	1	`dkim-private.pem`	Clean pass (security@ placed first — still works)
7	0.5	0	1	`/data/dkim/dkim-private.pem`	PVC path via init container copy
8	1.0	1	1	`dkim-private.pem`	Clean pass
9	0.5	0	1	`dkim-private.pem`	Correct approach; tool restart + ArgoCD timing
10	1.0	1	1	`dkim-private.pem`	Clean pass

arubis/maddy-spf-dkim-domain-migration-review-v30.md

Select an option

No results found