Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active April 6, 2026 22:51
Show Gist options
  • Select an option

  • Save arubis/a3fbb87868b730d009a21283cb363241 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/a3fbb87868b730d009a21283cb363241 to your computer and use it in GitHub Desktop.
Review: maddy-spf-dkim-domain-migration v30 (follow-up)

Review: maddy-spf-dkim-domain-migration v30

Task UUID: a4bc3f9c-fe11-4473-960c-2717c39f2417 Version: v30 (2026-04-06) Backend: biggie-nebula (teapot) Reviewer: Dylan Fitzgerald Prior review: v18 (2026-04-02)


Verdict: APPROVE

Both v18 blockers are resolved. The task produces genuine discriminative signal — a clean 60/40 split on email_e2e driven by whether agents correctly discover the DKIM key_path from the mounted Secret. Mean is 0.80 against a ≤0.85 teapot threshold. Task design is strong: layered sabotage, ArgoCD GitOps pressure, and diverse failure modes across runs.

Two non-blocking suggestions for future improvement are noted below.


Eval Results

Metric Value
Rollouts 10
Scores 1.0, 0.5, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 1.0
Mean 0.80 (threshold ≤0.85 — passes)
Subscore Weight Pass rate Runs
email_e2e 0.50 6/10 (60%) 1,0,1,1,0,1,0,1,0,1
mail_security 0.50 10/10 (100%) 1,1,1,1,1,1,1,1,1,1

mail_security passes all 10 runs via structural proxy checks (string presence in config, pod status). We investigated making it functional (send through the DKIM relay) but found all 10 agents correctly fix the relay's key material — a functional test would also be 10/10. The score distribution is effectively a bimodal 0.5/1.0 split gated by email_e2e, which is where the real discrimination lives.


Previous Blockers — Status

Blocker 1: Missing DKIM Discovery Signal — Resolved. Setup now writes a broken DKIM block with wrong selector (legacy) and wrong key_path (old-signing.key), giving agents a clear breadcrumb. The correct fix (selector default, key_path dkim-private.pem) is exactly where failing agents diverge from passing ones.

Blocker 2: Grafana Dashboard Subscore — Resolved. Removed entirely. Grader now has two clean subscores at equal 0.50 weight.


What's Working Well

The DKIM key_path discriminator is genuine and diverse. The three genuine email_e2e failures aren't the same mistake repeated — each agent chose a different creative-but-wrong key management strategy:

Run Strategy Why it fails
2 Generated new keypair, used dkim-private.key Non-standard filename; also reordered auth entries, confusing grader's sender selection
5 Generated new keypair, added default.key to Secret Non-standard name; ArgoCD annotation diff triggered reconcile race
7 Correct key, but copied to PVC via init container PVC copy goes stale vs. direct Secret mount under ArgoCD reconciliation

Run 9 used the correct approach (same as all passing runs) but experienced a tool restart mid-run that caused ArgoCD to revert configs. The agent recovered, but the grader likely hit a transient timing window. This is environmental, not a task design issue.

Other design strengths:

  • Layered sabotage (DNS poison, selector breakage, domain config, DKIM keys, TLS) requires roughly sequential peeling — you can't test SMTP until DNS and selectors work.
  • ArgoCD selfHeal with 10-second reconciliation forces agents into the GitOps workflow. kubectl apply gets reverted; they must discover the Gitea repo and push there.
  • The relay's obfuscated deployment (base64+zlib in a ConfigMap) prevents agents from casually reading relay.py, while _ck() provides a real cryptographic gate.
  • The inbound policy allowlist ($(authorized_senders) = legacy.bleater.internal in a separate ConfigMap) is a subtle second-order dependency that rewards thorough investigation.

Non-Blocking Suggestions

These are improvements worth considering for a future iteration, not blockers to acceptance.

  • Instrument grader feedback on email_e2e failure. The feedback field is currently null for all failing runs, which makes it hard to distinguish skill-based failures from timing flakiness. Capturing which sub-step failed (SMTP connect, SMTP send, IMAP auth, IMAP read, security@ check) would improve maintainability and inform future grader refinement.

  • Increase IMAP retry window. The grader waits 5s then retries 3x at 3s intervals (~14s total). The multi-hop delivery path (submission → DKIM sign → relay → verify → smtp → mailbox) crosses two pods. Bumping to 8-10s initial wait or adding a 4th retry would reduce the timing flakiness that likely caused run 9's false failure.


Appendix: Per-Run Detail

Run Score email_e2e mail_security key_path Notes
1 1.0 1 1 dkim-private.pem Clean pass
2 0.5 0 1 dkim-private.key New keypair, wrong filename, auth entry reorder
3 1.0 1 1 dkim-private.pem Clean pass
4 1.0 1 1 dkim-private.pem Clean pass
5 0.5 0 1 default.key New keypair, non-standard Secret entries
6 1.0 1 1 dkim-private.pem Clean pass (security@ placed first — still works)
7 0.5 0 1 /data/dkim/dkim-private.pem PVC path via init container copy
8 1.0 1 1 dkim-private.pem Clean pass
9 0.5 0 1 dkim-private.pem Correct approach; tool restart + ArgoCD timing
10 1.0 1 1 dkim-private.pem Clean pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment