Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner architecture, first head-to-head vs Pilot 11's full-Opus Manager)

Run ID: 2026-04-27T10-49-19_magellan-backups Plugin: magellan-backups (same regression-test plugin as Pilot 11; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headed — pilot run of the dialog-auto-dismiss work from 72aa07a) Wallclock: 44 min (~10 min lost to a parallel-provision port-collision race; bug-adjusted ~34 min) Total cost: $45.30 (vs Pilot 11's $102.90 — −56%) Recall: 8/10 (vs Pilot 11's 10/10)

Pre-pilot harness changes (pre-locked hypothesis from `904015e`)

This was the FIRST validation of the Sonnet-default Manager + Opus-planner subagent architecture from b95af9e:

Manager runs on Sonnet 4.6 throughout the conversation (all the mechanical work — file IO, prompt assembly, jq merging, dispatch).
Planner-opus is a new subagent pinned to model: opus via frontmatter, dispatched ONLY for Phase 1.5 (static analysis) and Phase 3 (charter generation) — the high-cognition phases where Opus's nuance is load-bearing.
Testers stay on Sonnet (no change vs Pilot 11).
Stacked on top of: 5 process fixes from da3e894 and dialog auto-dismiss flags for both headed and headless modes from 72aa07a.

Pre-locked targets:

Target	Goal
Recall	10/10 (no regression vs Pilot 11)
Total cost	≤ $75
Manager cost	≤ $45
Wallclock	≤ 30 min
Schema validation failures	0
CWD misroutes	0

TL;DR — cost won, recall regressed by 2

Metric	Pilot 11 (Opus Mgr)	Pilot 12 (Sonnet Mgr + Opus planner)	Δ
Recall	10/10	8/10	−2
Total cost	$102.90	$45.30	−$57.60 (−56%)
Manager cost	$70.81	$11.31	−$59.50
Subagent cost	$32.08	$33.99	+$1.91
Schema validation failures	yes (jq-patched)	0	win
CWD misroutes	6/6 wave Testers	0	win
Wallclock	36 min	~44 min*	* ~10 min lost to provisioning bug

The 10-min provisioning loss came from concurrent studio site create invocations racing on port assignment (3/7 sites failed and had to be retried serially). The studio-provision.sh lock fix in c523d94 closes this.

The architecture was net positive on mechanics but lost recall on two specific issues:

#	Planted issue	Verdict	Why missed
4	User export (selective) includes `user_pass` hashes	MISS	Charter set found `wp_users` password leakage in the FULL backup ZIP (`a3` confirmed) but no charter assigned the same probe to the SELECTIVE export surface — the planner-opus didn't carry the bug class across surfaces
10	Concurrent backups corrupt zip (manual + cron same minute)	MISS	Breadth Tester probed double-click rapid-fire (F3b) and found JS button-disable protection — but the cron-vs-manual race was not chartered as a cross-feature seam

Critically: both misses traced to the planner-opus charter generation, NOT to the Sonnet Manager's orchestration. Sonnet's mechanical phases ran cleanly; Opus's planning phases had two specific gaps. This means the architecture itself worked as designed.

Per-charter PQIP

Charter	Priority	Type	Bugs	Status
backup-artifact-andlist	critical	andlist	6	complete (12/12 turns)
restore-destructive-andlist	critical	andlist	7	complete (12/12 turns)
db-dump-scale-andlist	high	andlist	3	complete (10/12 turns)
schedule-settings-cluster	high	hypothesis-cluster	5	complete (8/8 turns)
schedule-x-artifact-crossfeature	high	cross-feature	4	complete (10/10 turns)
full-surface-breadth	high	breadth	5	complete (26/30 turns)
ui-cross-cutting-cluster	medium	hypothesis-cluster	4	complete (8/8 turns)
Totals			34	7/7 complete

Plus 1 Question, 12 Improvements, 10 Praises. Novel bugs found beyond the answer key: zip-slip path traversal (critical), arbitrary SQL execution via database.sql (critical), no server-side restore confirm gate, restore self-modifies plugin files mid-execution, persistent backup directory after deactivation, single shared nonce for all 3 AJAX endpoints, sidebar URL mismatch, etc. — 9+ findings not in ISSUES.md.

Token routing — exact numbers from `token-usage.json`

Component	Model	Cost
Manager (this conversation)	Sonnet 4.6	$7.15
Manager (small Opus tail)	Opus 4.7	$4.16
Planner Phase 1.5 + Phase 3	Opus 4.7	$4.11
Recon Tester + 7 wave Testers	Sonnet 4.6	$29.88
Total		$45.30

Opus = 13% of subagent total. The Manager-side cost dropped from $70.81 (Pilot 11 Opus) to $11.31 (Pilot 12 Sonnet+small Opus tail) — −84% on the Manager line alone. Subagent cost was essentially flat.

Process observations shipped after Pilot 12

The post-pilot retro shipped 8 harness fixes in c523d94:

Studio port-collision lock — portable mkdir-based lock around studio site create so concurrent invocations don't race the port pool. Fixes the ~10-min wallclock loss this pilot saw.
Phase 5 enforcement — loud "DO NOT WRITE final-report.md BY HAND" warning + 3-step ordered protocol (stamp completed_at → capture-run-tokens → aggregate-reports). Pilot 12 Manager wrote final-report freehand and never invoked capture-run-tokens.mjs, so cost was projected, not measured.
completed_at shell-generated — date -u +%Y-%m-%dT%H:%M:%SZ instead of the Manager typing it freehand (Pilot 12 typo'd a future timestamp).
PQIP report key shape documented — schema gets a _doc field calling out pqip.{problems,questions,improvements,praises} (NOT items[]).
AGENTS.md MCP serialization caveat — wave wallclock = longest Tester, not (sum/N).
Multi-surface artifact rule (Pilot 12 Issue 4 fix) — if a plugin has ≥2 artifact-producing surfaces, the a3 probe MUST run on every surface independently. Codified in skills/tester-mindset/SKILL.md and planner-opus.md Step 8.5.
Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — two trigger paths writing to a deterministic shared resource form a first-class cross-feature seam. Codified in skills/tester-mindset/SKILL.md and planner-opus.md Step 8.6.
Recon role reframed as scout, not hunter — recon's deliverable is a briefing for the Tester army (terrain context), not a bug hit-list.

Plus the T1 script-driven mechanical phases (4b4e0fe):

scripts/list-site-meta.sh — bulk site URL/path read (1 call instead of 7).
scripts/provision-charters.sh — bulk parallel provisioning with the new lock.
scripts/teardown-all-sites.sh — bulk teardown.
scripts/generate-charter-files.mjs + schemas/charter-set.schema.json — planner writes ONE structured JSON; script renders 8 markdown files deterministically.
Slimmer Tester dispatch prompts (~12 lines, not 60+).
Manager-side discipline: don't re-read charter files (Tester reads its own).

Cross-pilot arc (updated)

Pilot	Driver	Mgr model	Recall	Cost	Notes
11 (backups)	Chrome DevTools headed	Opus	10/10	$102.90	Item E first run
12 (backups, Sonnet-Mgr first run)	Chrome DevTools headed	Sonnet+Opus-planner	8/10	$45.30	−56% cost; 2-bug recall regression traced to planner gaps, not Mgr routing

What Pilot 12 validated

Sonnet Manager works. Mechanics ran clean: zero schema failures, zero CWD misroutes, all 7 reports validated on first write.
Opus planner works when its rules are tight. The two misses (Issues 4 and 10) were both rule-shaped — a multi-surface gap and a missing cross-feature seam pattern — not "Opus got worse." Codifying those rules closes both miss classes for future pilots.
Cost story is real. 56% reduction is sustainable as long as the planner stays on Opus and the Manager stays on Sonnet.

What Pilot 12 left open

The 8 harness fixes + T1 land before the next pilot. Pilot 13 will measure whether they hold (no provisioning bug, clean Phase 5, multi-surface rule fires on Issue 4, concurrent-trigger rule fires on Issue 10).

Files:

Final report: runs/2026-04-27T10-49-19_magellan-backups/final-report.md
Token usage: runs/2026-04-27T10-49-19_magellan-backups/token-usage.json
Manifest: runs/2026-04-27T10-49-19_magellan-backups/manifest.json

alopezari/pilot12-gist.md

Select an option

No results found

Select an option

No results found

Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner architecture, first head-to-head vs Pilot 11's full-Opus Manager)

Pre-pilot harness changes (pre-locked hypothesis from `904015e`)

TL;DR — cost won, recall regressed by 2

Per-charter PQIP

Token routing — exact numbers from `token-usage.json`

Process observations shipped after Pilot 12

Cross-pilot arc (updated)

What Pilot 12 validated

What Pilot 12 left open

alopezari/pilot12-gist.md

Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner architecture, first head-to-head vs Pilot 11's full-Opus Manager)

Pre-pilot harness changes (pre-locked hypothesis from 904015e)

TL;DR — cost won, recall regressed by 2

Per-charter PQIP

Token routing — exact numbers from token-usage.json

Process observations shipped after Pilot 12

Cross-pilot arc (updated)

What Pilot 12 validated

What Pilot 12 left open

Pre-pilot harness changes (pre-locked hypothesis from `904015e`)

Token routing — exact numbers from `token-usage.json`