| Task UUID | a4bc3f9c-fe11-4473-960c-2717c39f2417 |
| Version | 19 |
| Thread | https://discord.com/channels/1427397917685321919/1487685828166553737 |
| Review date | 2026-04-02 |
| Criterion | Status |
|---|---|
| Solvable | PASS — test-solution scores 1.0 |
| Challenging | PASS on paper (mean 0.41), but artificially deflated — see blockers |
| Substantial | PASS — multi-layer sabotage + GitOps + DKIM + dashboard ≥ 4H |
Two blocking fixes, then re-eval:
| Priority | What | Where |
|---|---|---|
| Blocking | Add broken DKIM block (modify { dkim { ... } } with wrong key path/old domain) to initial maddy.conf — agents need a starting signal to discover the signing requirement |
setup.sh |
| Blocking | Widen dashboard name fallback to also match "mail" (7/10 valid dashboards rejected) |
grader.py |
| After fixes | Re-run 8+ biggie-nebula evals. Difficulty threshold: mean < 0.50 (docker backend) or ≤ 0.80 (teapot). The teapot number is our current working threshold as reviewers — no hard guidance has been issued yet, so expect it may shift. | — |
The task design is excellent — ArgoCD self-healing, Maddy variable scoping, and multi-path Gitea discovery are high-quality challenges. Details and additional suggestions below.
| Check | Pass rate | Weight | Signal? |
|---|---|---|---|
email_e2e |
10/10 | 0.34 | None — too easy |
mail_security |
0/10 | 0.33 | None — breadcrumbs exist but too indirect (see blocker 1) |
grafana_dashboard |
2/10 | 0.33 | Only real signal (but 7 false negatives from name-match bug) |
Note: 0.34/0.33/0.33 is standard thirds rounding — the bot flags this incorrectly.
Dead weight across every version since introduced (5+ versions, 30+ rollouts total). The check requires a modify { dkim { key_path ... } } block in maddy.conf.
Breadcrumbs that do exist:
- The DKIM private key is mounted into the Maddy pod at
/etc/maddy/dkim(setup.sh patches the statefulset). A senior engineer inspecting volume mounts should ask: "Why is this key here if nothing in the config references it?" - The relay code only verifies, never signs. A senior DKIM practitioner would recognize: verification is at the relay, so signing must happen at the MTA.
- The task says "signed and verified through it" — signing is the MTA's job, verification is the relay's.
Why agents still fail (0/30+ across all versions):
- The initial maddy.conf has zero DKIM content — no block, no comment, nothing. Agents conclude DKIM was never part of Maddy's config, which is wrong but reasonable when the config shows no trace of it.
- The three-step inference chain (mounted key → verify-only relay → Maddy must sign) requires connecting signals across pod specs, relay source code, and DKIM domain knowledge — then producing correct Maddy-specific syntax from that inference alone.
- The task prompt says the relay handles signing: "outgoing mail was being signed and verified through it." Agents take this at face value.
key_path appears zero times across all 10 transcripts. 4/10 agents add DKIM + relay references but miss key_path, getting 0 due to all-or-nothing grading.
What agents build (10/10 runs):
submission:587 → deliver_to &local_mailboxes ← email_e2e passes ✓
What the grader expects for mail_security:
submission:587 → modify { dkim { key_path ... } } → dkim_relay → relay:2525 → maddy:25
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This block doesn't exist in the initial config.
No agent ever adds it. 0/10 pass.
These recommendations are about making the requirement fair to discover, not about making it easy to pass. The agent should still have to do real work — just with a viable starting point.
- Add a broken DKIM block to the initial maddy.conf in
setup.sh— e.g., with the old domain in the selector and wrong key path. This is also more realistic: in a real migration, the old DKIM config would still be there, just pointing at the wrong domain. Agents see the pattern; they still have to fix it, figure out the correct key path, and wire up relay routing. - Optionally clarify in task.yaml: "Before the migration, Maddy was configured to sign outgoing emails with DKIM before the relay forwarded them — this configuration was part of what broke." (This may not be needed if the broken block in the config is clear enough.)
- Consider splitting into two sub-checks (relay routing vs DKIM signing config) so the 4/10 agents who get relay routing right earn partial credit.
Scores will likely rise, but significant difficulty is currently dormant in the relay pipeline — agents have never engaged with it because they never add the DKIM block. Once the broken block gives agents a starting signal, they'll encounter:
- relay.py's
_ck()key consistency check — setup puts the wrong public key inmaddy-dkim-dns. The relay derives the public key from the private key and compares; mismatch →550 rejected. The agent must decode the relay (base64+zlib obfuscated in a ConfigMap), understand the check, derive the correct public key from the secret, and update the ConfigMap. - ...via Gitea, because ArgoCD reverts
kubectlchanges every 10 seconds.
This is already a multi-step chain (decode relay → understand crypto check → derive key → commit via GitOps) that no agent has been tested against. It may provide sufficient difficulty on its own.
Re-eval after the fairness fixes to verify. If scores are still too high, two clean reserve levers:
| Lever | What | Why it's fair |
|---|---|---|
| NetworkPolicy | Replace the possibly-nonfunctional iptables rule (see Warning 3) with a NetworkPolicy blocking relay→maddy:25 | kubectl get networkpolicy is standard discovery; survives kube-proxy NAT unlike pod-IP iptables rules |
| DKIM selector mismatch | Use a wrong selector (e.g., legacy) in the broken DKIM block, while maddy-dkim-dns has selector: "default" |
Both resources are in the maddy namespace; cross-referencing selectors is core DKIM knowledge. Realistic migration artifact |
These add discoverable, domain-relevant complexity without hiding information. Apply only if post-fix evals show scores above threshold (< 0.50 docker, ≤ 0.80 teapot — see note on thresholds above).
The grader has two discovery paths: (1) find ConfigMaps with label grafana_dashboard=1, (2) fall back to matching "maddy" in the ConfigMap name. Path 1 is fair — the grafana_dashboard=1 label is standard Grafana sidecar provisioning practice, and a senior DevOps engineer should know to check existing dashboard ConfigMaps for the convention (run 3 did exactly this). Path 2 is a grader bug.
| Dashboard outcome | Runs | Grader result |
|---|---|---|
| Valid dashboard, name contains "maddy" | 1, 3 | PASS |
| Valid dashboard, name contains "mail" (not "maddy") | 2, 4, 5, 6, 7, 9, 10 | FALSE NEGATIVE |
| Dashboard created but missing datasource | 8 | Genuine fail |
7/10 agents create valid dashboards that fail solely because their ConfigMap name says "mail-rejections" instead of "maddy-rejections." This isn't testing engineering skill — it's testing whether you happened to use a specific word in a name.
The label path is fine. Fix the name fallback:
- Widen to:
"maddy" in name.lower() or "mail" in name.lower() - Or move the content check earlier: the grader already checks data content for "maddy" after finding a match — do it for all ConfigMaps in the monitoring namespace upfront
The e2e check sends directly via submission:587 → local delivery. It doesn't verify mail routes through the relay, even though the task requires it. This is a design gap, not unfairness — it resolves naturally once mail_security is fixed.
Optional enhancement: Check for a DKIM-Signature header on received messages in email_e2e. This ties the two requirements together.
The core mail fix (DNS + selectors + domain config) is found in the first 25 turns by every agent. 34% of the score provides no signal. Revisit after mail_security fix — overall variance should improve naturally.
Setup adds iptables -I FORWARD -s ${RELAY_POD_IP} -d ${MADDY_POD_IP} -p tcp --dport 25 -j DROP, but relay logs show successful forwarding. The relay connects via service DNS (maddy-smtp.maddy.svc.cluster.local:25) — kube-proxy NATs through the service IP, bypassing the pod-IP-based rule.
Investigate: Verify whether the rule actually blocks traffic. If it doesn't, either remove it from setup.sh (dead code) or replace it with a NetworkPolicy (survives NAT). See the difficulty ratcheting section under Blocker 1 for how a NetworkPolicy could serve as a reserve difficulty lever if post-fix scores are too high.
The bot reviewed this task 8+ times across versions 14–19. Several of its flags are consistent across runs. Where we agree or disagree:
| Bot finding | Our assessment |
|---|---|
mail_security is dead weight (0/10) |
Agree — but we diagnose the root cause differently. The bot calls it "overly strict grading" and "undisclosed requirements." We find the grader checks are reasonable if agents have a starting signal; the problem is the missing DKIM block in the initial config, not the grader's string matching. The fix is in setup.sh, not grader.py. |
| Unequal weights (0.34/0.33/0.33) | Disagree — this is standard thirds rounding and sums to 1.0. The bot flags this as blocking in every review. It is not an issue. |
"Undisclosed requirements" for mail_security |
Partially disagree — the task prompt does reference the relay pipeline ("make sure whatever pipeline was in place before still works... outgoing mail was being signed and verified through it"). The relay pod and dkim-verifier service are discoverable in the environment. The real gap is not disclosure but starting signal — a broken DKIM block in the config would make the requirement fair to discover. |
email_e2e zero variance |
Agree it's low-signal now — but expect this to resolve naturally once mail_security is fixed, because agents will need to wire DKIM signing + relay routing to get e2e delivery working through the full pipeline. |
| Prompt clarity | Partially agree — the task prompt is vague about DKIM, but adding a broken DKIM block to the config is a better fix than spelling it out in the prompt. The task should reward investigation, not provide a checklist. |
| Dashboard Loki requirement (early versions) | No longer relevant — the grader now accepts Loki or Prometheus. The bot flagged this correctly in v14 reviews; the author fixed it. |
grafana_dashboard name matching |
Agree but with different scope — the bot notes "naming convention" failures but doesn't quantify them. We found 7/10 valid dashboards rejected solely for using "mail" instead of "maddy" in the ConfigMap name. |
These are genuinely strong task design elements worth preserving:
- ArgoCD self-healing — agents apply kubectl patches, watch them revert, discover GitOps. Creates an authentic "why does my fix keep disappearing?" moment.
- Maddy import variable scoping —
$(authorized_senders)defined in an imported file isn't available in the parent config. Agents spend 30–60 turns on real debugging with test pods. High-quality investigative challenge. - Multi-path Gitea discovery — some agents exec into pods to reach Gitea port 3000, others find the bare git repo on the node filesystem. Both work. Rewards creative problem-solving.
- Layered sabotage — DNS + selectors + config + keys is coherent and narrative-driven.
| What | Where | Why |
|---|---|---|
| Add sentence about DKIM signing to task description | task.yaml |
May not be needed if broken config block is clear enough |
Split mail_security into relay-routing + DKIM-signing sub-checks |
grader.py |
Rewards the 4/10 agents who get relay routing right |
Check for DKIM-Signature header in email_e2e |
grader.py |
Ties relay requirement into the e2e test |
| Reserve lever: Replace iptables rule with NetworkPolicy | setup.sh |
Only if post-fix scores exceed threshold; see Blocker 1 ratcheting section |
| Reserve lever: DKIM selector mismatch in broken block | setup.sh |
Only if post-fix scores exceed threshold; see Blocker 1 ratcheting section |