Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 27, 2026 15:11
Show Gist options
  • Select an option

  • Save alopezari/86ffb4ba877acd63228807efe287dae1 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/86ffb4ba877acd63228807efe287dae1 to your computer and use it in GitHub Desktop.
Magellan Pilot 14 — magellan-backups (first playwright-cli-headed pilot, 10/10 recall, but Manager cc1h cost regression — fixed by f395200 switching to official @playwright/cli skill)

Magellan Pilot 14 — magellan-backups (first playwright-cli-headed pilot, 10/10 recall, but Manager cost regressed via cache-creation jump)

Run ID: 2026-04-27T14-14-00_magellan-backups Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: Playwright CLI (headed) — first pilot of the third browser-driver tier (custom spec-file approach; superseded by f395200 which switched to Microsoft's official @playwright/cli) Wallclock: 33 min (best of the series) Total cost: $53.06 (vs Pilot 13 $38.47 — +38%) Recall: 10/10 (first clean recall under the new architecture)

Why a third browser driver

Pilot 13 confirmed Chrome DevTools MCP and Playwright MCP both work, but both share a single MCP server process per registered name — every Tester's tool call queues through one Node process. The hypothesis: a per-Tester process model would give true parallelism + reproducible artifacts + no MCP namespace bookkeeping.

Pilot 14 was the first head-to-head of that hypothesis. Same plugin, same charters, same conditions as Pilot 13. Only the driver changed.

Pre-pilot harness changes

  • da895daCanonical artifact a-anchors a1–a6 codified (the Pilot 13 retro fix). The planner uses fixed definitions for the artifact AND-list anchors:

    a1 a2 a3 a4 a5 a6
    Location Naming Contents (leakage AND omissions) Lifecycle Default blast radius Completeness against UI claim
  • A new playwright-cli-headed / playwright-cli-headless driver tier added to the harness: skills/browser-driver-playwright-cli/SKILL.md, scripts/playwright-cli-init.sh, driver-list updates in skills/browser-driver/SKILL.md, .claude/agents/tester.md, and AGENTS.md. Note: this initial implementation used a custom spec-file approach (probes/charter.spec.ts + npx playwright test). The Pilot 14 retro replaced it with Microsoft's bundled @playwright/cli skill — see f395200.

TL;DR — recall win, but cost regressed

Metric Pilot 13 (Chrome DevTools MCP, headless) Pilot 14 (Playwright CLI, headed) Δ
Recall 9/10 10/10 +1 (Issue 5 recovered via canonical a6)
Total cost $38.47 $53.06 +$14.59 (+38%)
Manager cost $8.10 (33 msgs) $25.42 (58 msgs) +$17.32 (+213%)
Subagent cost $30.37 $27.64 −$2.73 (−9%)
cc1h (Manager 1h cache creation) $0.58 $10.43 +18×
Wallclock 37 min 33 min −4 min
Schema validation failures 0 0 held
CWD misroutes 0 0 held

Recall recovered to 10/10 (Issue 5 caught via the canonical a6 = completeness against UI claim anchor — the Tester correctly observed that themes/* + plugins/* were in the ZIP but uploads/ was absent despite the UI claiming "full backup of wp-content"). Both Pilot 12 misses (Issues 4 and 10) stayed caught via the multi-surface and concurrent-trigger rules.

The Manager cost regression — and why it's fixable

Manager cost tripled despite running on Sonnet. The breakdown identified TWO root causes:

Root cause 1 — multi-paragraph Tester returns landing in conversation cache

Each of the 8 Testers returned a 500-word findings summary (anchor verdicts, problem list, file paths). Those summaries land in the Manager's conversation context and get cached at 1-hour TTL. With 8 Testers + planner + recon = ~10 long messages × ~500 words ≈ 5K cacheable words. Re-cache cycles as the conversation grows multiplied this by ~18×.

Fix shipped: f395200 — Tester role spec amended to require ONE-LINE returns (status=<status> report=<absolute-path>). All findings stay in report.json on disk; Manager reads final-report.md (produced deterministically by aggregate-reports.mjs) instead of accumulating per-Tester summaries.

Root cause 2 — Manager added belt-and-suspenders driver guidance to dispatch prompts

Looking at the Manager's actual dispatch prompts, each one had inline notes like "Driver is playwright-cli-headed — write a Playwright spec, not MCP calls. Run scripts/playwright-cli-init.sh after provisioning. See skills/browser-driver-playwright-cli/SKILL.md." That overhead × 8 dispatches inflated the conversation. The Tester role spec already auto-loads the driver skill based on the DRIVER: line — the Manager doesn't need to repeat any of it.

Fix shipped: f395200.claude/commands/test-plugin.md Phase 4 amended with explicit "no driver-specific addenda" rule + anti-pattern table. The dispatch prompt is the SAME shape regardless of driver; only DRIVER: differs.

And: the custom spec-file approach was also a mismatch

The first Pilot 14 implementation had Testers writing probes/charter.spec.ts files and running npx playwright test. After the run, the user pointed at Microsoft's bundled @playwright/cli skill (microsoft/playwright-cli repo, skills/playwright-cli/SKILL.md) — an agentic shell-CLI tool designed for AI agents to drive the browser one command at a time, not write test files.

Fix shipped: f395200 replaces the custom skill with a thin Magellan bridge over the official skill. scripts/playwright-cli-init.sh becomes a deprecation pointer; scripts/playwright-cli-config.sh writes a per-session config with the same dialog-suppression flags as the MCP drivers. The Tester's workflow is now: playwright-cli -s=<charter-slug> open <url> → snapshot returns refs (e1, e2, ...) → click e5 → repeat. Same agentic loop as MCP, but over shell commands.

Per-charter PQIP

Charter Priority Type Bugs Status
backup-zip-artifact-andlist critical andlist 6 complete
selective-export-artifact-andlist critical andlist 5 complete
restore-destructive-andlist critical andlist 6 complete
db-dump-scale-andlist high andlist 3 complete
manual-cron-concurrent-collision high cross-feature 2 complete
schedule-cookie-cluster high hypothesis-cluster 4 complete
delete-deactivation-cluster medium hypothesis-cluster 3 complete
breadth-tour medium breadth 6 complete
Totals 35 8/8 complete

10/10 recall. Both Step 8.5 (multi-surface) and Step 8.6 (concurrent-trigger) amendments fired correctly.

Token routing — exact

Component Model Cost
Manager (during run) Opus 4.7 $25.42
Planner Phase 1.5 + Phase 3 Opus 4.7 $7.22
Recon Tester + 7 wave Testers Sonnet 4.6 $20.43
Total $53.06

The Manager was on Opus this pilot (user had switched models post-Pilot 13) which amplified the cache-creation regression. With the post-Pilot-14 fixes (file-backed Tester returns + dispatch discipline) and a Sonnet Manager, the same workload should project to ~$10–12 Manager cost (vs Pilot 13's $8.10 baseline).

Honest research finding — the driver tradeoff

Playwright CLI is a real third option, not a clear winner over Chrome DevTools MCP. After three pilots:

Aspect Verdict
Recall CLI wins (10/10 vs 9/10) — but most of the win is the canonical-anchor fix, not the driver itself
Cost (this pilot) MCP wins (−$15) — but Manager-side overhead, fixable by f395200
Wallclock CLI wins (−4 min) — true per-process isolation
Reproducibility CLI wins — the bundled tool's session-state save/load is regression-friendly
Operator visibility (headed) CLI wins — operator could watch the actual flows
Effort to build MCP wins — already mature; CLI driver is a new tier

My read: Playwright CLI is better for batched deterministic charters (andlist with pre-known a1–a6 anchors, cross-feature with scripted seam probes, hypothesis-cluster with state-shape verifications). MCP stays better for breadth + recon (interactive look-think-act). The right routing is per-charter — the planner picks the driver based on charter shape; a single wave can mix drivers. skills/browser-driver/SKILL.md "When to use which driver" matrix codifies this.

Cross-pilot arc

Pilot Driver Mgr model Recall Cost Notes
11 (backups) Chrome DevTools headed Opus 10/10 $102.90 Item E first run
12 (backups, Sonnet-Mgr first run) Chrome DevTools headed Sonnet+Opus-planner 8/10 $45.30 −56% cost; 2 charter-design misses
13 (backups, T1 + 8 fixes validated) Chrome DevTools headless Sonnet+Opus-planner 9/10 $38.47 Pilot 12 misses recovered; a-anchor drift miss
14 (backups, playwright-cli-headed) Playwright CLI (headed) Opus + Opus-planner 10/10 $53.06 First clean recall; Mgr cost regression (cc1h) traced to Tester returns + dispatch addenda — fixed in f395200

What Pilot 14 validated

  • Canonical a-anchors recover Issue 5 without re-introducing Pilot 12's misses. 10/10 reachable under Sonnet-default architecture.
  • Playwright CLI is viable as a third driver — wallclock benefit is real, recall held, reproducibility win is meaningful.
  • The cc1h cache jump is diagnosable and fixable — not a fundamental property of the driver, just a Tester-return + dispatch-prompt shape problem. Fixed in f395200.

What Pilot 14 surfaced for the next round

The retro fixes in f395200 won't be measured until Pilot 15. The pre-locked targets:

  • Recall: 10/10
  • Total cost: ≤ Pilot 13's $38.47 (i.e., Pilot 14 wallclock benefit + Pilot 13 cost discipline)
  • Manager cc1h: back to Pilot 13 levels (~$0.50)
  • Tester return shape: one-line per Tester (no multi-paragraph summaries in conversation context)

Files:

  • Final report: runs/2026-04-27T14-14-00_magellan-backups/final-report.md
  • Token usage: runs/2026-04-27T14-14-00_magellan-backups/token-usage.json
  • Manifest: runs/2026-04-27T14-14-00_magellan-backups/manifest.json

Related commits:

  • da895da — canonical artifact a-anchors (Pilot 13 retro)
  • f395200 — Pilot 14 retro: switch to official @playwright/cli + tighten Manager↔Tester contract
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment