Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner architecture, first head-to-head vs Pilot 11's full-Opus Manager)
Run ID: 2026-04-27T10-49-19_magellan-backups
Plugin: magellan-backups (same regression-test plugin as Pilot 11; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: Chrome DevTools MCP (headed — pilot run of the dialog-auto-dismiss work from 72aa07a)
Wallclock: 44 min (~10 min lost to a parallel-provision port-collision race; bug-adjusted ~34 min)
Total cost: $45.30 (vs Pilot 11's $102.90 — −56%)
Recall: 8/10 (vs Pilot 11's 10/10)
This was the FIRST validation of the Sonnet-default Manager + Opus-planner subagent architecture from b95af9e:
- Manager runs on Sonnet 4.6 throughout the conversation (all the mechanical work — file IO, prompt assembly, jq merging, dispatch).
- Planner-opus is a new subagent pinned to
model: opusvia frontmatter, dispatched ONLY for Phase 1.5 (static analysis) and Phase 3 (charter generation) — the high-cognition phases where Opus's nuance is load-bearing. - Testers stay on Sonnet (no change vs Pilot 11).
- Stacked on top of: 5 process fixes from
da3e894and dialog auto-dismiss flags for both headed and headless modes from72aa07a.
Pre-locked targets:
| Target | Goal |
|---|---|
| Recall | 10/10 (no regression vs Pilot 11) |
| Total cost | ≤ $75 |
| Manager cost | ≤ $45 |
| Wallclock | ≤ 30 min |
| Schema validation failures | 0 |
| CWD misroutes | 0 |
| Metric | Pilot 11 (Opus Mgr) | Pilot 12 (Sonnet Mgr + Opus planner) | Δ |
|---|---|---|---|
| Recall | 10/10 | 8/10 | −2 |
| Total cost | $102.90 | $45.30 | −$57.60 (−56%) |
| Manager cost | $70.81 | $11.31 | −$59.50 |
| Subagent cost | $32.08 | $33.99 | +$1.91 |
| Schema validation failures | yes (jq-patched) | 0 | win |
| CWD misroutes | 6/6 wave Testers | 0 | win |
| Wallclock | 36 min | ~44 min* | * ~10 min lost to provisioning bug |
The 10-min provisioning loss came from concurrent studio site create invocations racing on port assignment (3/7 sites failed and had to be retried serially). The studio-provision.sh lock fix in c523d94 closes this.
The architecture was net positive on mechanics but lost recall on two specific issues:
| # | Planted issue | Verdict | Why missed |
|---|---|---|---|
| 4 | User export (selective) includes user_pass hashes |
MISS | Charter set found wp_users password leakage in the FULL backup ZIP (a3 confirmed) but no charter assigned the same probe to the SELECTIVE export surface — the planner-opus didn't carry the bug class across surfaces |
| 10 | Concurrent backups corrupt zip (manual + cron same minute) | MISS | Breadth Tester probed double-click rapid-fire (F3b) and found JS button-disable protection — but the cron-vs-manual race was not chartered as a cross-feature seam |
Critically: both misses traced to the planner-opus charter generation, NOT to the Sonnet Manager's orchestration. Sonnet's mechanical phases ran cleanly; Opus's planning phases had two specific gaps. This means the architecture itself worked as designed.
| Charter | Priority | Type | Bugs | Status |
|---|---|---|---|---|
| backup-artifact-andlist | critical | andlist | 6 | complete (12/12 turns) |
| restore-destructive-andlist | critical | andlist | 7 | complete (12/12 turns) |
| db-dump-scale-andlist | high | andlist | 3 | complete (10/12 turns) |
| schedule-settings-cluster | high | hypothesis-cluster | 5 | complete (8/8 turns) |
| schedule-x-artifact-crossfeature | high | cross-feature | 4 | complete (10/10 turns) |
| full-surface-breadth | high | breadth | 5 | complete (26/30 turns) |
| ui-cross-cutting-cluster | medium | hypothesis-cluster | 4 | complete (8/8 turns) |
| Totals | 34 | 7/7 complete |
Plus 1 Question, 12 Improvements, 10 Praises. Novel bugs found beyond the answer key: zip-slip path traversal (critical), arbitrary SQL execution via database.sql (critical), no server-side restore confirm gate, restore self-modifies plugin files mid-execution, persistent backup directory after deactivation, single shared nonce for all 3 AJAX endpoints, sidebar URL mismatch, etc. — 9+ findings not in ISSUES.md.
| Component | Model | Cost |
|---|---|---|
| Manager (this conversation) | Sonnet 4.6 | $7.15 |
| Manager (small Opus tail) | Opus 4.7 | $4.16 |
| Planner Phase 1.5 + Phase 3 | Opus 4.7 | $4.11 |
| Recon Tester + 7 wave Testers | Sonnet 4.6 | $29.88 |
| Total | $45.30 |
Opus = 13% of subagent total. The Manager-side cost dropped from $70.81 (Pilot 11 Opus) to $11.31 (Pilot 12 Sonnet+small Opus tail) — −84% on the Manager line alone. Subagent cost was essentially flat.
The post-pilot retro shipped 8 harness fixes in c523d94:
- Studio port-collision lock — portable mkdir-based lock around
studio site createso concurrent invocations don't race the port pool. Fixes the ~10-min wallclock loss this pilot saw. - Phase 5 enforcement — loud "DO NOT WRITE final-report.md BY HAND" warning + 3-step ordered protocol (stamp
completed_at→ capture-run-tokens → aggregate-reports). Pilot 12 Manager wrote final-report freehand and never invokedcapture-run-tokens.mjs, so cost was projected, not measured. completed_atshell-generated —date -u +%Y-%m-%dT%H:%M:%SZinstead of the Manager typing it freehand (Pilot 12 typo'd a future timestamp).- PQIP report key shape documented — schema gets a
_docfield calling outpqip.{problems,questions,improvements,praises}(NOTitems[]). - AGENTS.md MCP serialization caveat — wave wallclock = longest Tester, not (sum/N).
- Multi-surface artifact rule (Pilot 12 Issue 4 fix) — if a plugin has ≥2 artifact-producing surfaces, the
a3probe MUST run on every surface independently. Codified inskills/tester-mindset/SKILL.mdandplanner-opus.mdStep 8.5. - Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — two trigger paths writing to a deterministic shared resource form a first-class cross-feature seam. Codified in
skills/tester-mindset/SKILL.mdandplanner-opus.mdStep 8.6. - Recon role reframed as scout, not hunter — recon's deliverable is a briefing for the Tester army (terrain context), not a bug hit-list.
Plus the T1 script-driven mechanical phases (4b4e0fe):
scripts/list-site-meta.sh— bulk site URL/path read (1 call instead of 7).scripts/provision-charters.sh— bulk parallel provisioning with the new lock.scripts/teardown-all-sites.sh— bulk teardown.scripts/generate-charter-files.mjs+schemas/charter-set.schema.json— planner writes ONE structured JSON; script renders 8 markdown files deterministically.- Slimmer Tester dispatch prompts (~12 lines, not 60+).
- Manager-side discipline: don't re-read charter files (Tester reads its own).
| Pilot | Driver | Mgr model | Recall | Cost | Notes |
|---|---|---|---|---|---|
| 11 (backups) | Chrome DevTools headed | Opus | 10/10 | $102.90 | Item E first run |
| 12 (backups, Sonnet-Mgr first run) | Chrome DevTools headed | Sonnet+Opus-planner | 8/10 | $45.30 | −56% cost; 2-bug recall regression traced to planner gaps, not Mgr routing |
- Sonnet Manager works. Mechanics ran clean: zero schema failures, zero CWD misroutes, all 7 reports validated on first write.
- Opus planner works when its rules are tight. The two misses (Issues 4 and 10) were both rule-shaped — a multi-surface gap and a missing cross-feature seam pattern — not "Opus got worse." Codifying those rules closes both miss classes for future pilots.
- Cost story is real. 56% reduction is sustainable as long as the planner stays on Opus and the Manager stays on Sonnet.
The 8 harness fixes + T1 land before the next pilot. Pilot 13 will measure whether they hold (no provisioning bug, clean Phase 5, multi-surface rule fires on Issue 4, concurrent-trigger rule fires on Issue 10).
Files:
- Final report:
runs/2026-04-27T10-49-19_magellan-backups/final-report.md - Token usage:
runs/2026-04-27T10-49-19_magellan-backups/token-usage.json - Manifest:
runs/2026-04-27T10-49-19_magellan-backups/manifest.json