Review: `maddy-spf-dkim-domain-migration`


Task UUID	`a4bc3f9c-fe11-4473-960c-2717c39f2417`
Version	19
Thread	https://discord.com/channels/1427397917685321919/1487685828166553737
Review date	2026-04-02

Verdict: NEEDS_WORK

Criterion	Status
Solvable	PASS — `test-solution` scores 1.0
Challenging	PASS on paper (mean 0.41), but artificially deflated — see blockers
Substantial	PASS — multi-layer sabotage + GitOps + DKIM + dashboard ≥ 4H

Two blocking fixes, then re-eval:

Priority	What	Where
Blocking	Add broken DKIM block (`modify { dkim { ... } }` with wrong key path/old domain) to initial maddy.conf — agents need a starting signal to discover the signing requirement	`setup.sh`
Blocking	Widen dashboard name fallback to also match `"mail"` (7/10 valid dashboards rejected)	`grader.py`
After fixes	Re-run 8+ biggie-nebula evals. Difficulty threshold: mean < 0.50 (docker backend) or ≤ 0.80 (teapot). The teapot number is our current working threshold as reviewers — no hard guidance has been issued yet, so expect it may shift.	—

The task design is excellent — ArgoCD self-healing, Maddy variable scoping, and multi-path Gitea discovery are high-quality challenges. Details and additional suggestions below.

Check	Pass rate	Weight	Signal?
`email_e2e`	10/10	0.34	None — too easy
`mail_security`	0/10	0.33	None — breadcrumbs exist but too indirect (see blocker 1)
`grafana_dashboard`	2/10	0.33	Only real signal (but 7 false negatives from name-match bug)

Note: 0.34/0.33/0.33 is standard thirds rounding — the bot flags this incorrectly.

Blocker 1: `mail_security` — discovery path needs a stronger starting signal (0/10 runs, 33%)

Dead weight across every version since introduced (5+ versions, 30+ rollouts total). The check requires a modify { dkim { key_path ... } } block in maddy.conf.

Breadcrumbs that do exist:

The DKIM private key is mounted into the Maddy pod at /etc/maddy/dkim (setup.sh patches the statefulset). A senior engineer inspecting volume mounts should ask: "Why is this key here if nothing in the config references it?"
The relay code only verifies, never signs. A senior DKIM practitioner would recognize: verification is at the relay, so signing must happen at the MTA.
The task says "signed and verified through it" — signing is the MTA's job, verification is the relay's.

Why agents still fail (0/30+ across all versions):

The initial maddy.conf has zero DKIM content — no block, no comment, nothing. Agents conclude DKIM was never part of Maddy's config, which is wrong but reasonable when the config shows no trace of it.
The three-step inference chain (mounted key → verify-only relay → Maddy must sign) requires connecting signals across pod specs, relay source code, and DKIM domain knowledge — then producing correct Maddy-specific syntax from that inference alone.
The task prompt says the relay handles signing: "outgoing mail was being signed and verified through it." Agents take this at face value.

key_path appears zero times across all 10 transcripts. 4/10 agents add DKIM + relay references but miss key_path, getting 0 due to all-or-nothing grading.

What agents build vs what the grader expects

What agents build (10/10 runs):
  submission:587 → deliver_to &local_mailboxes     ← email_e2e passes ✓

What the grader expects for mail_security:
  submission:587 → modify { dkim { key_path ... } } → dkim_relay → relay:2525 → maddy:25
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                   This block doesn't exist in the initial config.
                   No agent ever adds it. 0/10 pass.

Fix — restore fairness

These recommendations are about making the requirement fair to discover, not about making it easy to pass. The agent should still have to do real work — just with a viable starting point.

Add a broken DKIM block to the initial maddy.conf in setup.sh — e.g., with the old domain in the selector and wrong key path. This is also more realistic: in a real migration, the old DKIM config would still be there, just pointing at the wrong domain. Agents see the pattern; they still have to fix it, figure out the correct key path, and wire up relay routing.
Optionally clarify in task.yaml: "Before the migration, Maddy was configured to sign outgoing emails with DKIM before the relay forwarded them — this configuration was part of what broke." (This may not be needed if the broken block in the config is clear enough.)
Consider splitting into two sub-checks (relay routing vs DKIM signing config) so the 4/10 agents who get relay routing right earn partial credit.

Impact on difficulty

Scores will likely rise, but significant difficulty is currently dormant in the relay pipeline — agents have never engaged with it because they never add the DKIM block. Once the broken block gives agents a starting signal, they'll encounter:

relay.py's _ck() key consistency check — setup puts the wrong public key in maddy-dkim-dns. The relay derives the public key from the private key and compares; mismatch → 550 rejected. The agent must decode the relay (base64+zlib obfuscated in a ConfigMap), understand the check, derive the correct public key from the secret, and update the ConfigMap.
...via Gitea, because ArgoCD reverts kubectl changes every 10 seconds.

This is already a multi-step chain (decode relay → understand crypto check → derive key → commit via GitOps) that no agent has been tested against. It may provide sufficient difficulty on its own.

Re-eval after the fairness fixes to verify. If scores are still too high, two clean reserve levers:

Lever	What	Why it's fair
NetworkPolicy	Replace the possibly-nonfunctional iptables rule (see Warning 3) with a NetworkPolicy blocking relay→maddy:25	`kubectl get networkpolicy` is standard discovery; survives kube-proxy NAT unlike pod-IP iptables rules
DKIM selector mismatch	Use a wrong selector (e.g., `legacy`) in the broken DKIM block, while `maddy-dkim-dns` has `selector: "default"`	Both resources are in the `maddy` namespace; cross-referencing selectors is core DKIM knowledge. Realistic migration artifact

These add discoverable, domain-relevant complexity without hiding information. Apply only if post-fix evals show scores above threshold (< 0.50 docker, ≤ 0.80 teapot — see note on thresholds above).

Blocker 2: `grafana_dashboard` — name-matching fallback rejects valid dashboards

The grader has two discovery paths: (1) find ConfigMaps with label grafana_dashboard=1, (2) fall back to matching "maddy" in the ConfigMap name. Path 1 is fair — the grafana_dashboard=1 label is standard Grafana sidecar provisioning practice, and a senior DevOps engineer should know to check existing dashboard ConfigMaps for the convention (run 3 did exactly this). Path 2 is a grader bug.

Dashboard outcome	Runs	Grader result
Valid dashboard, name contains "maddy"	1, 3	PASS
Valid dashboard, name contains "mail" (not "maddy")	2, 4, 5, 6, 7, 9, 10	FALSE NEGATIVE
Dashboard created but missing datasource	8	Genuine fail

7/10 agents create valid dashboards that fail solely because their ConfigMap name says "mail-rejections" instead of "maddy-rejections." This isn't testing engineering skill — it's testing whether you happened to use a specific word in a name.

Fix

The label path is fine. Fix the name fallback:

Widen to: "maddy" in name.lower() or "mail" in name.lower()
Or move the content check earlier: the grader already checks data content for "maddy" after finding a match — do it for all ConfigMaps in the monitoring namespace upfront

Warnings

1. `email_e2e` bypasses the relay

The e2e check sends directly via submission:587 → local delivery. It doesn't verify mail routes through the relay, even though the task requires it. This is a design gap, not unfairness — it resolves naturally once mail_security is fixed.

Optional enhancement: Check for a DKIM-Signature header on received messages in email_e2e. This ties the two requirements together.

2. `email_e2e` zero variance (10/10 pass)

The core mail fix (DNS + selectors + domain config) is found in the first 25 turns by every agent. 34% of the score provides no signal. Revisit after mail_security fix — overall variance should improve naturally.

3. iptables rule may not be functioning

Setup adds iptables -I FORWARD -s ${RELAY_POD_IP} -d ${MADDY_POD_IP} -p tcp --dport 25 -j DROP, but relay logs show successful forwarding. The relay connects via service DNS (maddy-smtp.maddy.svc.cluster.local:25) — kube-proxy NATs through the service IP, bypassing the pod-IP-based rule.

Investigate: Verify whether the rule actually blocks traffic. If it doesn't, either remove it from setup.sh (dead code) or replace it with a NetworkPolicy (survives NAT). See the difficulty ratcheting section under Blocker 1 for how a NetworkPolicy could serve as a reserve difficulty lever if post-fix scores are too high.

Where this review diverges from nebula-reviewer

The bot reviewed this task 8+ times across versions 14–19. Several of its flags are consistent across runs. Where we agree or disagree:

Bot finding	Our assessment
`mail_security` is dead weight (0/10)	Agree — but we diagnose the root cause differently. The bot calls it "overly strict grading" and "undisclosed requirements." We find the grader checks are reasonable if agents have a starting signal; the problem is the missing DKIM block in the initial config, not the grader's string matching. The fix is in `setup.sh`, not `grader.py`.
Unequal weights (0.34/0.33/0.33)	Disagree — this is standard thirds rounding and sums to 1.0. The bot flags this as blocking in every review. It is not an issue.
"Undisclosed requirements" for `mail_security`	Partially disagree — the task prompt does reference the relay pipeline ("make sure whatever pipeline was in place before still works... outgoing mail was being signed and verified through it"). The relay pod and dkim-verifier service are discoverable in the environment. The real gap is not disclosure but starting signal — a broken DKIM block in the config would make the requirement fair to discover.
`email_e2e` zero variance	Agree it's low-signal now — but expect this to resolve naturally once `mail_security` is fixed, because agents will need to wire DKIM signing + relay routing to get e2e delivery working through the full pipeline.
Prompt clarity	Partially agree — the task prompt is vague about DKIM, but adding a broken DKIM block to the config is a better fix than spelling it out in the prompt. The task should reward investigation, not provide a checklist.
Dashboard Loki requirement (early versions)	No longer relevant — the grader now accepts Loki or Prometheus. The bot flagged this correctly in v14 reviews; the author fixed it.
`grafana_dashboard` name matching	Agree but with different scope — the bot notes "naming convention" failures but doesn't quantify them. We found 7/10 valid dashboards rejected solely for using "mail" instead of "maddy" in the ConfigMap name.

What works well

These are genuinely strong task design elements worth preserving:

ArgoCD self-healing — agents apply kubectl patches, watch them revert, discover GitOps. Creates an authentic "why does my fix keep disappearing?" moment.
Maddy import variable scoping — $(authorized_senders) defined in an imported file isn't available in the parent config. Agents spend 30–60 turns on real debugging with test pods. High-quality investigative challenge.
Multi-path Gitea discovery — some agents exec into pods to reach Gitea port 3000, others find the bare git repo on the node filesystem. Both work. Rewards creative problem-solving.
Layered sabotage — DNS + selectors + config + keys is coherent and narrative-driven.

Additional suggestions (non-blocking)

What	Where	Why
Add sentence about DKIM signing to task description	`task.yaml`	May not be needed if broken config block is clear enough
Split `mail_security` into relay-routing + DKIM-signing sub-checks	`grader.py`	Rewards the 4/10 agents who get relay routing right
Check for `DKIM-Signature` header in `email_e2e`	`grader.py`	Ties relay requirement into the e2e test
Reserve lever: Replace iptables rule with NetworkPolicy	`setup.sh`	Only if post-fix scores exceed threshold; see Blocker 1 ratcheting section
Reserve lever: DKIM selector mismatch in broken block	`setup.sh`	Only if post-fix scores exceed threshold; see Blocker 1 ratcheting section

arubis/maddy-spf-dkim-review.md

Select an option

No results found

Select an option

No results found

Review: `maddy-spf-dkim-domain-migration`

Verdict: NEEDS_WORK

Blocker 1: `mail_security` — discovery path needs a stronger starting signal (0/10 runs, 33%)

What agents build vs what the grader expects

Fix — restore fairness

Impact on difficulty

Blocker 2: `grafana_dashboard` — name-matching fallback rejects valid dashboards

Fix

Warnings

1. `email_e2e` bypasses the relay

2. `email_e2e` zero variance (10/10 pass)

3. iptables rule may not be functioning

Where this review diverges from nebula-reviewer

What works well

Additional suggestions (non-blocking)

arubis/maddy-spf-dkim-review.md

Review: maddy-spf-dkim-domain-migration

Verdict: NEEDS_WORK

Blocker 1: mail_security — discovery path needs a stronger starting signal (0/10 runs, 33%)

What agents build vs what the grader expects

Fix — restore fairness

Impact on difficulty

Blocker 2: grafana_dashboard — name-matching fallback rejects valid dashboards

Fix

Warnings

1. email_e2e bypasses the relay

2. email_e2e zero variance (10/10 pass)

3. iptables rule may not be functioning

Where this review diverges from nebula-reviewer

What works well

Additional suggestions (non-blocking)

Review: `maddy-spf-dkim-domain-migration`

Blocker 1: `mail_security` — discovery path needs a stronger starting signal (0/10 runs, 33%)

Blocker 2: `grafana_dashboard` — name-matching fallback rejects valid dashboards

1. `email_e2e` bypasses the relay

2. `email_e2e` zero variance (10/10 pass)