Magellan Pilot 14 — magellan-backups (first playwright-cli-headed pilot, 10/10 recall, but Manager cost regressed via cache-creation jump)

Run ID: 2026-04-27T14-14-00_magellan-backups Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Playwright CLI (headed) — first pilot of the third browser-driver tier (custom spec-file approach; superseded by f395200 which switched to Microsoft's official @playwright/cli) Wallclock: 33 min (best of the series) Total cost: $53.06 (vs Pilot 13 $38.47 — +38%) Recall: 10/10 (first clean recall under the new architecture)

Why a third browser driver

Pilot 13 confirmed Chrome DevTools MCP and Playwright MCP both work, but both share a single MCP server process per registered name — every Tester's tool call queues through one Node process. The hypothesis: a per-Tester process model would give true parallelism + reproducible artifacts + no MCP namespace bookkeeping.

Pilot 14 was the first head-to-head of that hypothesis. Same plugin, same charters, same conditions as Pilot 13. Only the driver changed.

Pre-pilot harness changes

da895da — Canonical artifact a-anchors a1–a6 codified (the Pilot 13 retro fix). The planner uses fixed definitions for the artifact AND-list anchors:

a1 a2 a3 a4 a5 a6

Location Naming Contents (leakage AND omissions) Lifecycle Default blast radius Completeness against UI claim
A new playwright-cli-headed / playwright-cli-headless driver tier added to the harness: skills/browser-driver-playwright-cli/SKILL.md, scripts/playwright-cli-init.sh, driver-list updates in skills/browser-driver/SKILL.md, .claude/agents/tester.md, and AGENTS.md. Note: this initial implementation used a custom spec-file approach (probes/charter.spec.ts + npx playwright test). The Pilot 14 retro replaced it with Microsoft's bundled @playwright/cli skill — see f395200.

a1	a2	a3	a4	a5	a6
Location	Naming	Contents (leakage AND omissions)	Lifecycle	Default blast radius	Completeness against UI claim

TL;DR — recall win, but cost regressed

Metric	Pilot 13 (Chrome DevTools MCP, headless)	Pilot 14 (Playwright CLI, headed)	Δ
Recall	9/10	10/10	+1 (Issue 5 recovered via canonical a6)
Total cost	$38.47	$53.06	+$14.59 (+38%)
Manager cost	$8.10 (33 msgs)	$25.42 (58 msgs)	+$17.32 (+213%)
Subagent cost	$30.37	$27.64	−$2.73 (−9%)
`cc1h` (Manager 1h cache creation)	$0.58	$10.43	+18× ⚠
Wallclock	37 min	33 min	−4 min ✓
Schema validation failures	0	0	held
CWD misroutes	0	0	held

Recall recovered to 10/10 (Issue 5 caught via the canonical a6 = completeness against UI claim anchor — the Tester correctly observed that themes/* + plugins/* were in the ZIP but uploads/ was absent despite the UI claiming "full backup of wp-content"). Both Pilot 12 misses (Issues 4 and 10) stayed caught via the multi-surface and concurrent-trigger rules.

The Manager cost regression — and why it's fixable

Manager cost tripled despite running on Sonnet. The breakdown identified TWO root causes:

Root cause 1 — multi-paragraph Tester returns landing in conversation cache

Each of the 8 Testers returned a 500-word findings summary (anchor verdicts, problem list, file paths). Those summaries land in the Manager's conversation context and get cached at 1-hour TTL. With 8 Testers + planner + recon = ~10 long messages × ~500 words ≈ 5K cacheable words. Re-cache cycles as the conversation grows multiplied this by ~18×.

Fix shipped: f395200 — Tester role spec amended to require ONE-LINE returns (status=<status> report=<absolute-path>). All findings stay in report.json on disk; Manager reads final-report.md (produced deterministically by aggregate-reports.mjs) instead of accumulating per-Tester summaries.

Root cause 2 — Manager added belt-and-suspenders driver guidance to dispatch prompts

Looking at the Manager's actual dispatch prompts, each one had inline notes like "Driver is playwright-cli-headed — write a Playwright spec, not MCP calls. Run scripts/playwright-cli-init.sh after provisioning. See skills/browser-driver-playwright-cli/SKILL.md." That overhead × 8 dispatches inflated the conversation. The Tester role spec already auto-loads the driver skill based on the DRIVER: line — the Manager doesn't need to repeat any of it.

Fix shipped: f395200 — .claude/commands/test-plugin.md Phase 4 amended with explicit "no driver-specific addenda" rule + anti-pattern table. The dispatch prompt is the SAME shape regardless of driver; only DRIVER: differs.

And: the custom spec-file approach was also a mismatch

The first Pilot 14 implementation had Testers writing probes/charter.spec.ts files and running npx playwright test. After the run, the user pointed at Microsoft's bundled @playwright/cli skill (microsoft/playwright-cli repo, skills/playwright-cli/SKILL.md) — an agentic shell-CLI tool designed for AI agents to drive the browser one command at a time, not write test files.

Fix shipped: f395200 replaces the custom skill with a thin Magellan bridge over the official skill. scripts/playwright-cli-init.sh becomes a deprecation pointer; scripts/playwright-cli-config.sh writes a per-session config with the same dialog-suppression flags as the MCP drivers. The Tester's workflow is now: playwright-cli -s=<charter-slug> open <url> → snapshot returns refs (e1, e2, ...) → click e5 → repeat. Same agentic loop as MCP, but over shell commands.

Per-charter PQIP

Charter	Priority	Type	Bugs	Status
backup-zip-artifact-andlist	critical	andlist	6	complete
selective-export-artifact-andlist	critical	andlist	5	complete
restore-destructive-andlist	critical	andlist	6	complete
db-dump-scale-andlist	high	andlist	3	complete
manual-cron-concurrent-collision	high	cross-feature	2	complete
schedule-cookie-cluster	high	hypothesis-cluster	4	complete
delete-deactivation-cluster	medium	hypothesis-cluster	3	complete
breadth-tour	medium	breadth	6	complete
Totals			35	8/8 complete

10/10 recall. Both Step 8.5 (multi-surface) and Step 8.6 (concurrent-trigger) amendments fired correctly.

Token routing — exact

Component	Model	Cost
Manager (during run)	Opus 4.7	$25.42
Planner Phase 1.5 + Phase 3	Opus 4.7	$7.22
Recon Tester + 7 wave Testers	Sonnet 4.6	$20.43
Total		$53.06

The Manager was on Opus this pilot (user had switched models post-Pilot 13) which amplified the cache-creation regression. With the post-Pilot-14 fixes (file-backed Tester returns + dispatch discipline) and a Sonnet Manager, the same workload should project to ~$10–12 Manager cost (vs Pilot 13's $8.10 baseline).

Honest research finding — the driver tradeoff

Playwright CLI is a real third option, not a clear winner over Chrome DevTools MCP. After three pilots:

Aspect	Verdict
Recall	CLI wins (10/10 vs 9/10) — but most of the win is the canonical-anchor fix, not the driver itself
Cost (this pilot)	MCP wins (−$15) — but Manager-side overhead, fixable by `f395200`
Wallclock	CLI wins (−4 min) — true per-process isolation
Reproducibility	CLI wins — the bundled tool's session-state save/load is regression-friendly
Operator visibility (headed)	CLI wins — operator could watch the actual flows
Effort to build	MCP wins — already mature; CLI driver is a new tier

My read: Playwright CLI is better for batched deterministic charters (andlist with pre-known a1–a6 anchors, cross-feature with scripted seam probes, hypothesis-cluster with state-shape verifications). MCP stays better for breadth + recon (interactive look-think-act). The right routing is per-charter — the planner picks the driver based on charter shape; a single wave can mix drivers. skills/browser-driver/SKILL.md "When to use which driver" matrix codifies this.

Cross-pilot arc

Pilot	Driver	Mgr model	Recall	Cost	Notes
11 (backups)	Chrome DevTools headed	Opus	10/10	$102.90	Item E first run
12 (backups, Sonnet-Mgr first run)	Chrome DevTools headed	Sonnet+Opus-planner	8/10	$45.30	−56% cost; 2 charter-design misses
13 (backups, T1 + 8 fixes validated)	Chrome DevTools headless	Sonnet+Opus-planner	9/10	$38.47	Pilot 12 misses recovered; a-anchor drift miss
14 (backups, playwright-cli-headed)	Playwright CLI (headed)	Opus + Opus-planner	10/10	$53.06	First clean recall; Mgr cost regression (cc1h) traced to Tester returns + dispatch addenda — fixed in `f395200`

What Pilot 14 validated

Canonical a-anchors recover Issue 5 without re-introducing Pilot 12's misses. 10/10 reachable under Sonnet-default architecture.
Playwright CLI is viable as a third driver — wallclock benefit is real, recall held, reproducibility win is meaningful.
The cc1h cache jump is diagnosable and fixable — not a fundamental property of the driver, just a Tester-return + dispatch-prompt shape problem. Fixed in f395200.

What Pilot 14 surfaced for the next round

The retro fixes in f395200 won't be measured until Pilot 15. The pre-locked targets:

Recall: 10/10
Total cost: ≤ Pilot 13's $38.47 (i.e., Pilot 14 wallclock benefit + Pilot 13 cost discipline)
Manager cc1h: back to Pilot 13 levels (~$0.50)
Tester return shape: one-line per Tester (no multi-paragraph summaries in conversation context)

Files:

Final report: runs/2026-04-27T14-14-00_magellan-backups/final-report.md
Token usage: runs/2026-04-27T14-14-00_magellan-backups/token-usage.json
Manifest: runs/2026-04-27T14-14-00_magellan-backups/manifest.json

Related commits:

da895da — canonical artifact a-anchors (Pilot 13 retro)
f395200 — Pilot 14 retro: switch to official @playwright/cli + tighten Manager↔Tester contract

alopezari/pilot14-gist.md

Select an option

No results found