Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recall recovers Issues 4 & 10 but loses Issue 5 to a-anchor drift)

Run ID: 2026-04-27T12-28-35_magellan-backups Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headless — speed comparison vs Pilot 12's headed run) Wallclock: 37 min Total cost: $38.47 (vs Pilot 12 $45.30 — −15%) Recall: 9/10 (recovered Issues 4 + 10 from Pilot 12; lost Issue 5 — different miss class)

Pre-pilot harness changes

This pilot validated everything that landed between Pilot 12 and Pilot 13:

`c523d94` — 8 harness fixes from the Pilot 12 retro

Studio port-collision lock (mkdir-based, portable)
Phase 5 enforcement (mandatory capture-run-tokens.mjs + aggregate-reports.mjs)
Shell-generated completed_at timestamp
PQIP key shape doc on the schema (pqip.{problems,questions,improvements,praises})
AGENTS.md MCP serialization caveat
Multi-surface artifact rule (Pilot 12 Issue 4 fix) — a3 probe MUST run on every artifact-producing surface
Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — manual + cron writing to deterministic shared resource = cross-feature charter
Recon reframed as scout (briefing-shaped output, not bug hunt)

`4b4e0fe` — T1: script-driven mechanical Manager phases

Targets the 30-40 mechanical Manager turns observed in Pilot 12:

scripts/list-site-meta.sh — bulk site URL/path emit (1 call vs N)
scripts/provision-charters.sh — bulk parallel provisioning (uses new lock)
scripts/teardown-all-sites.sh — bulk teardown
scripts/generate-charter-files.mjs + schemas/charter-set.schema.json — planner writes ONE structured JSON; script renders 8 markdown files deterministically. Reduces planner-opus from 8 file writes to 1.
Slimmer Tester dispatch prompts (~12 lines vs 60+)
Manager-side rule: don't re-read charter files (Tester reads its own)

TL;DR — T1 delivered, but anchors regressed

Metric	Pilot 12	Pilot 13	Δ
Recall	8/10	9/10	+1 (recovered Issues 4 + 10; lost Issue 5)
Total cost	$45.30	$38.47	−15%
Manager cost	$11.31	$8.10	−$3.21 (−28%)
Subagent cost	$33.99	$30.37	−$3.62 (−11%)
Schema validation failures	0	0	held
CWD misroutes	0	0	held
Provisioning failures	3/7 (port race)	0	fixed by lock
Wallclock	44 min	37 min	−7 min
Charters	7	8	+1 (multi-surface rule split artifact andlist)

Recall vs the 10 planted issues

#	Planted issue	Pilot 12	Pilot 13	Notes
1	Progress bar 100%	✓	✓	unchanged
2	Schedule time format mismatch	✓	✓	unchanged
3	Email option key mismatch	✓	✓	unchanged
4	User export hashed passwords	✗ MISS	✓ FOUND	Step 8.5 multi-surface rule fired correctly — split into `export-artifact-andlist` charter that probed `user_pass` independently
5	Uploads missing from "Full Backup"	✓	✗ MISS	Pilot 13 regression — see "the one new miss class" below
6	No pre-restore backup	✓	✓	unchanged
7	Backups publicly accessible	✓	✓	unchanged
8	Corrupt restore truncates DB	✓	✓	unchanged
9	Large DB memory exhaustion	✓	✓	unchanged
10	Concurrent backups corrupt zip	✗ MISS	✓ FOUND	Step 8.6 concurrent-trigger rule fired — emitted `concurrent-trigger-cross-feature` charter; Tester confirmed via parallel `studio wp eval`

Both Pilot 12 misses recovered. One new miss (Issue 5).

The one new miss class — artifact a-anchor drift

Pilot 12 used a6 = completeness (caught Issue 5: "uploads missing from full backup"). Pilot 13's planner used a6 = cleanup-on-deactivation instead. The Tester verified themes/* + plugins/* were present in the ZIP but never asked "is uploads/ here?". Same code path, different anchor framing → different finding.

Root cause: the artifact a-anchors had never been formally canonicalized. Each planner improvised the names.

Fix shipped after Pilot 13 (commit da895da): codify the canonical artifact a-anchors verbatim in skills/tester-mindset/SKILL.md:

a1	a2	a3	a4	a5	a6
Location	Naming	Contents (leakage AND omissions, both mandatory)	Lifecycle (retention/rotation/cleanup-on-deactivation)	Default blast radius	Completeness against UI claim

planner-opus.md charter-sizing table now points at this section with an explicit "Pilot 13 lost recall on Issue 5 because a6 was re-mapped" note. The next planner uses these names verbatim — cleanup-on-deactivation goes back into a4 where it belongs as a lifecycle question.

Per-charter PQIP

Charter	Priority	Type	Bugs	Status
backup-artifact-andlist	critical	andlist	6	complete (8/12 turns)
restore-destructive-andlist	critical	andlist	7	complete (12/12 turns)
db-dump-scale-andlist	high	andlist	3	complete (10/12 turns)
schedule-settings-cluster	high	hypothesis-cluster	5	complete (7/8 turns)
schedule-x-artifact-crossfeature	high	cross-feature	4	complete (10/10 turns)
full-surface-breadth	high	breadth	5	complete (18/30 turns)
ui-cross-cutting-cluster	medium	hypothesis-cluster	4	complete (8/8 turns)
Totals			34	8/8 complete (incl. new export-artifact-andlist split)

Token routing — exact

Component	Model	Cost
Manager (during run)	Sonnet 4.6	~$5
Manager (post-run review)	Opus 4.7	~$3
Planner Phase 1.5 + Phase 3	Opus 4.7	$3.90
Recon Tester + 7 wave Testers	Sonnet 4.6	$26.48
Total		$38.47

T1 hit its target: ~30 mechanical Manager turns saved vs Pilot 12 (no per-charter file reads, no per-site jq calls, no per-charter teardown loops, no manual aggregation).

Cross-pilot arc

Pilot	Driver	Mgr model	Recall	Cost	Notes
11 (backups)	Chrome DevTools headed	Opus	10/10	$102.90	Item E first run
12 (backups, Sonnet-Mgr first run)	Chrome DevTools headed	Sonnet+Opus-planner	8/10	$45.30	−56% cost; 2 charter-design misses
13 (backups, T1 + 8 fixes validated)	Chrome DevTools headless	Sonnet+Opus-planner	9/10	$38.47	Pilot 12 misses recovered; new a-anchor drift miss

What Pilot 13 validated

T1 script-driven mechanical phases work: Manager cost down 28%, total cost down 15%
Studio port lock works: 0 provisioning failures across 8 concurrent self-provisioning Testers
Step 8.5 multi-surface rule fires correctly: Issue 4 recovered (was Pilot 12 miss)
Step 8.6 concurrent-trigger rule fires correctly: Issue 10 recovered (was Pilot 12 miss)
Headless mode = ~7 min wallclock save vs headed (Pilot 12)

What Pilot 13 surfaced

Improvised a-anchor names can drift between pilots → recall regression. Closed by da895da (canonical artifact AND-list anchors).
The Manager STILL has surface area to slim — Pilot 14 ran with playwright-cli-headed and exposed a Manager cc1h cache regression that pointed at "Tester returns multi-paragraph summaries get cached at 1-hour TTL." That's the next layer of optimization.

Files:

Final report: runs/2026-04-27T12-28-35_magellan-backups/final-report.md
Token usage: runs/2026-04-27T12-28-35_magellan-backups/token-usage.json
Manifest: runs/2026-04-27T12-28-35_magellan-backups/manifest.json

alopezari/pilot13-gist.md

Select an option

No results found

Select an option

No results found

Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recall recovers Issues 4 & 10 but loses Issue 5 to a-anchor drift)

Pre-pilot harness changes

`c523d94` — 8 harness fixes from the Pilot 12 retro

`4b4e0fe` — T1: script-driven mechanical Manager phases

TL;DR — T1 delivered, but anchors regressed

Recall vs the 10 planted issues

The one new miss class — artifact a-anchor drift

Per-charter PQIP

Token routing — exact

Cross-pilot arc

What Pilot 13 validated

What Pilot 13 surfaced

alopezari/pilot13-gist.md

Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recall recovers Issues 4 & 10 but loses Issue 5 to a-anchor drift)

Pre-pilot harness changes

c523d94 — 8 harness fixes from the Pilot 12 retro

4b4e0fe — T1: script-driven mechanical Manager phases

TL;DR — T1 delivered, but anchors regressed

Recall vs the 10 planted issues

The one new miss class — artifact a-anchor drift

Per-charter PQIP

Token routing — exact

Cross-pilot arc

What Pilot 13 validated

What Pilot 13 surfaced

`c523d94` — 8 harness fixes from the Pilot 12 retro

`4b4e0fe` — T1: script-driven mechanical Manager phases