Magellan Pilot 14 — magellan-backups (first playwright-cli-headed pilot, 10/10 recall, but Manager cost regressed via cache-creation jump)
Run ID: 2026-04-27T14-14-00_magellan-backups
Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: Playwright CLI (headed) — first pilot of the third browser-driver tier (custom spec-file approach; superseded by f395200 which switched to Microsoft's official @playwright/cli)
Wallclock: 33 min (best of the series)
Total cost: $53.06 (vs Pilot 13 $38.47 — +38%)
Recall: 10/10 (first clean recall under the new architecture)
Pilot 13 confirmed Chrome DevTools MCP and Playwright MCP both work, but both share a single MCP server process per registered name — every Tester's tool call queues through one Node process. The hypothesis: a per-Tester process model would give true parallelism + reproducible artifacts + no MCP namespace bookkeeping.
Pilot 14 was the first head-to-head of that hypothesis. Same plugin, same charters, same conditions as Pilot 13. Only the driver changed.
-
da895da— Canonical artifact a-anchors a1–a6 codified (the Pilot 13 retro fix). The planner uses fixed definitions for the artifact AND-list anchors:a1 a2 a3 a4 a5 a6 Location Naming Contents (leakage AND omissions) Lifecycle Default blast radius Completeness against UI claim -
A new
playwright-cli-headed/playwright-cli-headlessdriver tier added to the harness:skills/browser-driver-playwright-cli/SKILL.md,scripts/playwright-cli-init.sh, driver-list updates inskills/browser-driver/SKILL.md,.claude/agents/tester.md, andAGENTS.md. Note: this initial implementation used a custom spec-file approach (probes/charter.spec.ts+npx playwright test). The Pilot 14 retro replaced it with Microsoft's bundled@playwright/cliskill — seef395200.
| Metric | Pilot 13 (Chrome DevTools MCP, headless) | Pilot 14 (Playwright CLI, headed) | Δ |
|---|---|---|---|
| Recall | 9/10 | 10/10 | +1 (Issue 5 recovered via canonical a6) |
| Total cost | $38.47 | $53.06 | +$14.59 (+38%) |
| Manager cost | $8.10 (33 msgs) | $25.42 (58 msgs) | +$17.32 (+213%) |
| Subagent cost | $30.37 | $27.64 | −$2.73 (−9%) |
cc1h (Manager 1h cache creation) |
$0.58 | $10.43 | +18× ⚠ |
| Wallclock | 37 min | 33 min | −4 min ✓ |
| Schema validation failures | 0 | 0 | held |
| CWD misroutes | 0 | 0 | held |
Recall recovered to 10/10 (Issue 5 caught via the canonical a6 = completeness against UI claim anchor — the Tester correctly observed that themes/* + plugins/* were in the ZIP but uploads/ was absent despite the UI claiming "full backup of wp-content"). Both Pilot 12 misses (Issues 4 and 10) stayed caught via the multi-surface and concurrent-trigger rules.
Manager cost tripled despite running on Sonnet. The breakdown identified TWO root causes:
Each of the 8 Testers returned a 500-word findings summary (anchor verdicts, problem list, file paths). Those summaries land in the Manager's conversation context and get cached at 1-hour TTL. With 8 Testers + planner + recon = ~10 long messages × ~500 words ≈ 5K cacheable words. Re-cache cycles as the conversation grows multiplied this by ~18×.
Fix shipped: f395200 — Tester role spec amended to require ONE-LINE returns (status=<status> report=<absolute-path>). All findings stay in report.json on disk; Manager reads final-report.md (produced deterministically by aggregate-reports.mjs) instead of accumulating per-Tester summaries.
Looking at the Manager's actual dispatch prompts, each one had inline notes like "Driver is playwright-cli-headed — write a Playwright spec, not MCP calls. Run scripts/playwright-cli-init.sh after provisioning. See skills/browser-driver-playwright-cli/SKILL.md." That overhead × 8 dispatches inflated the conversation. The Tester role spec already auto-loads the driver skill based on the DRIVER: line — the Manager doesn't need to repeat any of it.
Fix shipped: f395200 — .claude/commands/test-plugin.md Phase 4 amended with explicit "no driver-specific addenda" rule + anti-pattern table. The dispatch prompt is the SAME shape regardless of driver; only DRIVER: differs.
The first Pilot 14 implementation had Testers writing probes/charter.spec.ts files and running npx playwright test. After the run, the user pointed at Microsoft's bundled @playwright/cli skill (microsoft/playwright-cli repo, skills/playwright-cli/SKILL.md) — an agentic shell-CLI tool designed for AI agents to drive the browser one command at a time, not write test files.
Fix shipped: f395200 replaces the custom skill with a thin Magellan bridge over the official skill. scripts/playwright-cli-init.sh becomes a deprecation pointer; scripts/playwright-cli-config.sh writes a per-session config with the same dialog-suppression flags as the MCP drivers. The Tester's workflow is now: playwright-cli -s=<charter-slug> open <url> → snapshot returns refs (e1, e2, ...) → click e5 → repeat. Same agentic loop as MCP, but over shell commands.
| Charter | Priority | Type | Bugs | Status |
|---|---|---|---|---|
| backup-zip-artifact-andlist | critical | andlist | 6 | complete |
| selective-export-artifact-andlist | critical | andlist | 5 | complete |
| restore-destructive-andlist | critical | andlist | 6 | complete |
| db-dump-scale-andlist | high | andlist | 3 | complete |
| manual-cron-concurrent-collision | high | cross-feature | 2 | complete |
| schedule-cookie-cluster | high | hypothesis-cluster | 4 | complete |
| delete-deactivation-cluster | medium | hypothesis-cluster | 3 | complete |
| breadth-tour | medium | breadth | 6 | complete |
| Totals | 35 | 8/8 complete |
10/10 recall. Both Step 8.5 (multi-surface) and Step 8.6 (concurrent-trigger) amendments fired correctly.
| Component | Model | Cost |
|---|---|---|
| Manager (during run) | Opus 4.7 | $25.42 |
| Planner Phase 1.5 + Phase 3 | Opus 4.7 | $7.22 |
| Recon Tester + 7 wave Testers | Sonnet 4.6 | $20.43 |
| Total | $53.06 |
The Manager was on Opus this pilot (user had switched models post-Pilot 13) which amplified the cache-creation regression. With the post-Pilot-14 fixes (file-backed Tester returns + dispatch discipline) and a Sonnet Manager, the same workload should project to ~$10–12 Manager cost (vs Pilot 13's $8.10 baseline).
Playwright CLI is a real third option, not a clear winner over Chrome DevTools MCP. After three pilots:
| Aspect | Verdict |
|---|---|
| Recall | CLI wins (10/10 vs 9/10) — but most of the win is the canonical-anchor fix, not the driver itself |
| Cost (this pilot) | MCP wins (−$15) — but Manager-side overhead, fixable by f395200 |
| Wallclock | CLI wins (−4 min) — true per-process isolation |
| Reproducibility | CLI wins — the bundled tool's session-state save/load is regression-friendly |
| Operator visibility (headed) | CLI wins — operator could watch the actual flows |
| Effort to build | MCP wins — already mature; CLI driver is a new tier |
My read: Playwright CLI is better for batched deterministic charters (andlist with pre-known a1–a6 anchors, cross-feature with scripted seam probes, hypothesis-cluster with state-shape verifications). MCP stays better for breadth + recon (interactive look-think-act). The right routing is per-charter — the planner picks the driver based on charter shape; a single wave can mix drivers. skills/browser-driver/SKILL.md "When to use which driver" matrix codifies this.
| Pilot | Driver | Mgr model | Recall | Cost | Notes |
|---|---|---|---|---|---|
| 11 (backups) | Chrome DevTools headed | Opus | 10/10 | $102.90 | Item E first run |
| 12 (backups, Sonnet-Mgr first run) | Chrome DevTools headed | Sonnet+Opus-planner | 8/10 | $45.30 | −56% cost; 2 charter-design misses |
| 13 (backups, T1 + 8 fixes validated) | Chrome DevTools headless | Sonnet+Opus-planner | 9/10 | $38.47 | Pilot 12 misses recovered; a-anchor drift miss |
| 14 (backups, playwright-cli-headed) | Playwright CLI (headed) | Opus + Opus-planner | 10/10 | $53.06 | First clean recall; Mgr cost regression (cc1h) traced to Tester returns + dispatch addenda — fixed in f395200 |
- Canonical a-anchors recover Issue 5 without re-introducing Pilot 12's misses. 10/10 reachable under Sonnet-default architecture.
- Playwright CLI is viable as a third driver — wallclock benefit is real, recall held, reproducibility win is meaningful.
- The cc1h cache jump is diagnosable and fixable — not a fundamental property of the driver, just a Tester-return + dispatch-prompt shape problem. Fixed in
f395200.
The retro fixes in f395200 won't be measured until Pilot 15. The pre-locked targets:
- Recall: 10/10
- Total cost: ≤ Pilot 13's $38.47 (i.e., Pilot 14 wallclock benefit + Pilot 13 cost discipline)
- Manager cc1h: back to Pilot 13 levels (~$0.50)
- Tester return shape: one-line per Tester (no multi-paragraph summaries in conversation context)
Files:
- Final report:
runs/2026-04-27T14-14-00_magellan-backups/final-report.md - Token usage:
runs/2026-04-27T14-14-00_magellan-backups/token-usage.json - Manifest:
runs/2026-04-27T14-14-00_magellan-backups/manifest.json
Related commits:
da895da— canonical artifact a-anchors (Pilot 13 retro)f395200— Pilot 14 retro: switch to official@playwright/cli+ tighten Manager↔Tester contract