Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 27, 2026 15:11
Show Gist options
  • Select an option

  • Save alopezari/781c5df0401a6315fec294b85e338a26 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/781c5df0401a6315fec294b85e338a26 to your computer and use it in GitHub Desktop.
Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner first run; -56% cost vs Pilot 11; 8/10 recall, 2 charter-design misses traced to planner gaps not Mgr routing)

Magellan Pilot 12 — magellan-backups (Sonnet-Manager + Opus-planner architecture, first head-to-head vs Pilot 11's full-Opus Manager)

Run ID: 2026-04-27T10-49-19_magellan-backups Plugin: magellan-backups (same regression-test plugin as Pilot 11; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headed — pilot run of the dialog-auto-dismiss work from 72aa07a) Wallclock: 44 min (~10 min lost to a parallel-provision port-collision race; bug-adjusted ~34 min) Total cost: $45.30 (vs Pilot 11's $102.90 — −56%) Recall: 8/10 (vs Pilot 11's 10/10)

Pre-pilot harness changes (pre-locked hypothesis from 904015e)

This was the FIRST validation of the Sonnet-default Manager + Opus-planner subagent architecture from b95af9e:

  • Manager runs on Sonnet 4.6 throughout the conversation (all the mechanical work — file IO, prompt assembly, jq merging, dispatch).
  • Planner-opus is a new subagent pinned to model: opus via frontmatter, dispatched ONLY for Phase 1.5 (static analysis) and Phase 3 (charter generation) — the high-cognition phases where Opus's nuance is load-bearing.
  • Testers stay on Sonnet (no change vs Pilot 11).
  • Stacked on top of: 5 process fixes from da3e894 and dialog auto-dismiss flags for both headed and headless modes from 72aa07a.

Pre-locked targets:

Target Goal
Recall 10/10 (no regression vs Pilot 11)
Total cost ≤ $75
Manager cost ≤ $45
Wallclock ≤ 30 min
Schema validation failures 0
CWD misroutes 0

TL;DR — cost won, recall regressed by 2

Metric Pilot 11 (Opus Mgr) Pilot 12 (Sonnet Mgr + Opus planner) Δ
Recall 10/10 8/10 −2
Total cost $102.90 $45.30 −$57.60 (−56%)
Manager cost $70.81 $11.31 −$59.50
Subagent cost $32.08 $33.99 +$1.91
Schema validation failures yes (jq-patched) 0 win
CWD misroutes 6/6 wave Testers 0 win
Wallclock 36 min ~44 min* * ~10 min lost to provisioning bug

The 10-min provisioning loss came from concurrent studio site create invocations racing on port assignment (3/7 sites failed and had to be retried serially). The studio-provision.sh lock fix in c523d94 closes this.

The architecture was net positive on mechanics but lost recall on two specific issues:

# Planted issue Verdict Why missed
4 User export (selective) includes user_pass hashes MISS Charter set found wp_users password leakage in the FULL backup ZIP (a3 confirmed) but no charter assigned the same probe to the SELECTIVE export surface — the planner-opus didn't carry the bug class across surfaces
10 Concurrent backups corrupt zip (manual + cron same minute) MISS Breadth Tester probed double-click rapid-fire (F3b) and found JS button-disable protection — but the cron-vs-manual race was not chartered as a cross-feature seam

Critically: both misses traced to the planner-opus charter generation, NOT to the Sonnet Manager's orchestration. Sonnet's mechanical phases ran cleanly; Opus's planning phases had two specific gaps. This means the architecture itself worked as designed.

Per-charter PQIP

Charter Priority Type Bugs Status
backup-artifact-andlist critical andlist 6 complete (12/12 turns)
restore-destructive-andlist critical andlist 7 complete (12/12 turns)
db-dump-scale-andlist high andlist 3 complete (10/12 turns)
schedule-settings-cluster high hypothesis-cluster 5 complete (8/8 turns)
schedule-x-artifact-crossfeature high cross-feature 4 complete (10/10 turns)
full-surface-breadth high breadth 5 complete (26/30 turns)
ui-cross-cutting-cluster medium hypothesis-cluster 4 complete (8/8 turns)
Totals 34 7/7 complete

Plus 1 Question, 12 Improvements, 10 Praises. Novel bugs found beyond the answer key: zip-slip path traversal (critical), arbitrary SQL execution via database.sql (critical), no server-side restore confirm gate, restore self-modifies plugin files mid-execution, persistent backup directory after deactivation, single shared nonce for all 3 AJAX endpoints, sidebar URL mismatch, etc. — 9+ findings not in ISSUES.md.

Token routing — exact numbers from token-usage.json

Component Model Cost
Manager (this conversation) Sonnet 4.6 $7.15
Manager (small Opus tail) Opus 4.7 $4.16
Planner Phase 1.5 + Phase 3 Opus 4.7 $4.11
Recon Tester + 7 wave Testers Sonnet 4.6 $29.88
Total $45.30

Opus = 13% of subagent total. The Manager-side cost dropped from $70.81 (Pilot 11 Opus) to $11.31 (Pilot 12 Sonnet+small Opus tail) — −84% on the Manager line alone. Subagent cost was essentially flat.

Process observations shipped after Pilot 12

The post-pilot retro shipped 8 harness fixes in c523d94:

  1. Studio port-collision lock — portable mkdir-based lock around studio site create so concurrent invocations don't race the port pool. Fixes the ~10-min wallclock loss this pilot saw.
  2. Phase 5 enforcement — loud "DO NOT WRITE final-report.md BY HAND" warning + 3-step ordered protocol (stamp completed_at → capture-run-tokens → aggregate-reports). Pilot 12 Manager wrote final-report freehand and never invoked capture-run-tokens.mjs, so cost was projected, not measured.
  3. completed_at shell-generateddate -u +%Y-%m-%dT%H:%M:%SZ instead of the Manager typing it freehand (Pilot 12 typo'd a future timestamp).
  4. PQIP report key shape documented — schema gets a _doc field calling out pqip.{problems,questions,improvements,praises} (NOT items[]).
  5. AGENTS.md MCP serialization caveat — wave wallclock = longest Tester, not (sum/N).
  6. Multi-surface artifact rule (Pilot 12 Issue 4 fix) — if a plugin has ≥2 artifact-producing surfaces, the a3 probe MUST run on every surface independently. Codified in skills/tester-mindset/SKILL.md and planner-opus.md Step 8.5.
  7. Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — two trigger paths writing to a deterministic shared resource form a first-class cross-feature seam. Codified in skills/tester-mindset/SKILL.md and planner-opus.md Step 8.6.
  8. Recon role reframed as scout, not hunter — recon's deliverable is a briefing for the Tester army (terrain context), not a bug hit-list.

Plus the T1 script-driven mechanical phases (4b4e0fe):

  • scripts/list-site-meta.sh — bulk site URL/path read (1 call instead of 7).
  • scripts/provision-charters.sh — bulk parallel provisioning with the new lock.
  • scripts/teardown-all-sites.sh — bulk teardown.
  • scripts/generate-charter-files.mjs + schemas/charter-set.schema.json — planner writes ONE structured JSON; script renders 8 markdown files deterministically.
  • Slimmer Tester dispatch prompts (~12 lines, not 60+).
  • Manager-side discipline: don't re-read charter files (Tester reads its own).

Cross-pilot arc (updated)

Pilot Driver Mgr model Recall Cost Notes
11 (backups) Chrome DevTools headed Opus 10/10 $102.90 Item E first run
12 (backups, Sonnet-Mgr first run) Chrome DevTools headed Sonnet+Opus-planner 8/10 $45.30 −56% cost; 2-bug recall regression traced to planner gaps, not Mgr routing

What Pilot 12 validated

  • Sonnet Manager works. Mechanics ran clean: zero schema failures, zero CWD misroutes, all 7 reports validated on first write.
  • Opus planner works when its rules are tight. The two misses (Issues 4 and 10) were both rule-shaped — a multi-surface gap and a missing cross-feature seam pattern — not "Opus got worse." Codifying those rules closes both miss classes for future pilots.
  • Cost story is real. 56% reduction is sustainable as long as the planner stays on Opus and the Manager stays on Sonnet.

What Pilot 12 left open

The 8 harness fixes + T1 land before the next pilot. Pilot 13 will measure whether they hold (no provisioning bug, clean Phase 5, multi-surface rule fires on Issue 4, concurrent-trigger rule fires on Issue 10).


Files:

  • Final report: runs/2026-04-27T10-49-19_magellan-backups/final-report.md
  • Token usage: runs/2026-04-27T10-49-19_magellan-backups/token-usage.json
  • Manifest: runs/2026-04-27T10-49-19_magellan-backups/manifest.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment