Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 27, 2026 15:11
Show Gist options
  • Select an option

  • Save alopezari/51984dc548d1cc0058d121eda150ecb9 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/51984dc548d1cc0058d121eda150ecb9 to your computer and use it in GitHub Desktop.
Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recovered Issues 4 & 10; new a-anchor drift miss on Issue 5; -15% cost vs Pilot 12)

Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recall recovers Issues 4 & 10 but loses Issue 5 to a-anchor drift)

Run ID: 2026-04-27T12-28-35_magellan-backups Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headless — speed comparison vs Pilot 12's headed run) Wallclock: 37 min Total cost: $38.47 (vs Pilot 12 $45.30 — −15%) Recall: 9/10 (recovered Issues 4 + 10 from Pilot 12; lost Issue 5 — different miss class)

Pre-pilot harness changes

This pilot validated everything that landed between Pilot 12 and Pilot 13:

c523d94 — 8 harness fixes from the Pilot 12 retro

  1. Studio port-collision lock (mkdir-based, portable)
  2. Phase 5 enforcement (mandatory capture-run-tokens.mjs + aggregate-reports.mjs)
  3. Shell-generated completed_at timestamp
  4. PQIP key shape doc on the schema (pqip.{problems,questions,improvements,praises})
  5. AGENTS.md MCP serialization caveat
  6. Multi-surface artifact rule (Pilot 12 Issue 4 fix) — a3 probe MUST run on every artifact-producing surface
  7. Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — manual + cron writing to deterministic shared resource = cross-feature charter
  8. Recon reframed as scout (briefing-shaped output, not bug hunt)

4b4e0fe — T1: script-driven mechanical Manager phases

Targets the 30-40 mechanical Manager turns observed in Pilot 12:

  • scripts/list-site-meta.sh — bulk site URL/path emit (1 call vs N)
  • scripts/provision-charters.sh — bulk parallel provisioning (uses new lock)
  • scripts/teardown-all-sites.sh — bulk teardown
  • scripts/generate-charter-files.mjs + schemas/charter-set.schema.json — planner writes ONE structured JSON; script renders 8 markdown files deterministically. Reduces planner-opus from 8 file writes to 1.
  • Slimmer Tester dispatch prompts (~12 lines vs 60+)
  • Manager-side rule: don't re-read charter files (Tester reads its own)

TL;DR — T1 delivered, but anchors regressed

Metric Pilot 12 Pilot 13 Δ
Recall 8/10 9/10 +1 (recovered Issues 4 + 10; lost Issue 5)
Total cost $45.30 $38.47 −15%
Manager cost $11.31 $8.10 −$3.21 (−28%)
Subagent cost $33.99 $30.37 −$3.62 (−11%)
Schema validation failures 0 0 held
CWD misroutes 0 0 held
Provisioning failures 3/7 (port race) 0 fixed by lock
Wallclock 44 min 37 min −7 min
Charters 7 8 +1 (multi-surface rule split artifact andlist)

Recall vs the 10 planted issues

# Planted issue Pilot 12 Pilot 13 Notes
1 Progress bar 100% unchanged
2 Schedule time format mismatch unchanged
3 Email option key mismatch unchanged
4 User export hashed passwords ✗ MISS ✓ FOUND Step 8.5 multi-surface rule fired correctly — split into export-artifact-andlist charter that probed user_pass independently
5 Uploads missing from "Full Backup" ✗ MISS Pilot 13 regression — see "the one new miss class" below
6 No pre-restore backup unchanged
7 Backups publicly accessible unchanged
8 Corrupt restore truncates DB unchanged
9 Large DB memory exhaustion unchanged
10 Concurrent backups corrupt zip ✗ MISS ✓ FOUND Step 8.6 concurrent-trigger rule fired — emitted concurrent-trigger-cross-feature charter; Tester confirmed via parallel studio wp eval

Both Pilot 12 misses recovered. One new miss (Issue 5).

The one new miss class — artifact a-anchor drift

Pilot 12 used a6 = completeness (caught Issue 5: "uploads missing from full backup"). Pilot 13's planner used a6 = cleanup-on-deactivation instead. The Tester verified themes/* + plugins/* were present in the ZIP but never asked "is uploads/ here?". Same code path, different anchor framing → different finding.

Root cause: the artifact a-anchors had never been formally canonicalized. Each planner improvised the names.

Fix shipped after Pilot 13 (commit da895da): codify the canonical artifact a-anchors verbatim in skills/tester-mindset/SKILL.md:

a1 a2 a3 a4 a5 a6
Location Naming Contents (leakage AND omissions, both mandatory) Lifecycle (retention/rotation/cleanup-on-deactivation) Default blast radius Completeness against UI claim

planner-opus.md charter-sizing table now points at this section with an explicit "Pilot 13 lost recall on Issue 5 because a6 was re-mapped" note. The next planner uses these names verbatim — cleanup-on-deactivation goes back into a4 where it belongs as a lifecycle question.

Per-charter PQIP

Charter Priority Type Bugs Status
backup-artifact-andlist critical andlist 6 complete (8/12 turns)
restore-destructive-andlist critical andlist 7 complete (12/12 turns)
db-dump-scale-andlist high andlist 3 complete (10/12 turns)
schedule-settings-cluster high hypothesis-cluster 5 complete (7/8 turns)
schedule-x-artifact-crossfeature high cross-feature 4 complete (10/10 turns)
full-surface-breadth high breadth 5 complete (18/30 turns)
ui-cross-cutting-cluster medium hypothesis-cluster 4 complete (8/8 turns)
Totals 34 8/8 complete (incl. new export-artifact-andlist split)

Token routing — exact

Component Model Cost
Manager (during run) Sonnet 4.6 ~$5
Manager (post-run review) Opus 4.7 ~$3
Planner Phase 1.5 + Phase 3 Opus 4.7 $3.90
Recon Tester + 7 wave Testers Sonnet 4.6 $26.48
Total $38.47

T1 hit its target: ~30 mechanical Manager turns saved vs Pilot 12 (no per-charter file reads, no per-site jq calls, no per-charter teardown loops, no manual aggregation).

Cross-pilot arc

Pilot Driver Mgr model Recall Cost Notes
11 (backups) Chrome DevTools headed Opus 10/10 $102.90 Item E first run
12 (backups, Sonnet-Mgr first run) Chrome DevTools headed Sonnet+Opus-planner 8/10 $45.30 −56% cost; 2 charter-design misses
13 (backups, T1 + 8 fixes validated) Chrome DevTools headless Sonnet+Opus-planner 9/10 $38.47 Pilot 12 misses recovered; new a-anchor drift miss

What Pilot 13 validated

  • T1 script-driven mechanical phases work: Manager cost down 28%, total cost down 15%
  • Studio port lock works: 0 provisioning failures across 8 concurrent self-provisioning Testers
  • Step 8.5 multi-surface rule fires correctly: Issue 4 recovered (was Pilot 12 miss)
  • Step 8.6 concurrent-trigger rule fires correctly: Issue 10 recovered (was Pilot 12 miss)
  • Headless mode = ~7 min wallclock save vs headed (Pilot 12)

What Pilot 13 surfaced

  • Improvised a-anchor names can drift between pilots → recall regression. Closed by da895da (canonical artifact AND-list anchors).
  • The Manager STILL has surface area to slim — Pilot 14 ran with playwright-cli-headed and exposed a Manager cc1h cache regression that pointed at "Tester returns multi-paragraph summaries get cached at 1-hour TTL." That's the next layer of optimization.

Files:

  • Final report: runs/2026-04-27T12-28-35_magellan-backups/final-report.md
  • Token usage: runs/2026-04-27T12-28-35_magellan-backups/token-usage.json
  • Manifest: runs/2026-04-27T12-28-35_magellan-backups/manifest.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment