Magellan Pilot 13 — magellan-backups (T1 + 8 harness fixes validated; recall recovers Issues 4 & 10 but loses Issue 5 to a-anchor drift)
Run ID: 2026-04-27T12-28-35_magellan-backups
Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: Chrome DevTools MCP (headless — speed comparison vs Pilot 12's headed run)
Wallclock: 37 min
Total cost: $38.47 (vs Pilot 12 $45.30 — −15%)
Recall: 9/10 (recovered Issues 4 + 10 from Pilot 12; lost Issue 5 — different miss class)
This pilot validated everything that landed between Pilot 12 and Pilot 13:
- Studio port-collision lock (mkdir-based, portable)
- Phase 5 enforcement (mandatory
capture-run-tokens.mjs+aggregate-reports.mjs) - Shell-generated
completed_attimestamp - PQIP key shape doc on the schema (
pqip.{problems,questions,improvements,praises}) - AGENTS.md MCP serialization caveat
- Multi-surface artifact rule (Pilot 12 Issue 4 fix) —
a3probe MUST run on every artifact-producing surface - Concurrent-trigger seam rule (Pilot 12 Issue 10 fix) — manual + cron writing to deterministic shared resource = cross-feature charter
- Recon reframed as scout (briefing-shaped output, not bug hunt)
Targets the 30-40 mechanical Manager turns observed in Pilot 12:
scripts/list-site-meta.sh— bulk site URL/path emit (1 call vs N)scripts/provision-charters.sh— bulk parallel provisioning (uses new lock)scripts/teardown-all-sites.sh— bulk teardownscripts/generate-charter-files.mjs+schemas/charter-set.schema.json— planner writes ONE structured JSON; script renders 8 markdown files deterministically. Reduces planner-opus from 8 file writes to 1.- Slimmer Tester dispatch prompts (~12 lines vs 60+)
- Manager-side rule: don't re-read charter files (Tester reads its own)
| Metric | Pilot 12 | Pilot 13 | Δ |
|---|---|---|---|
| Recall | 8/10 | 9/10 | +1 (recovered Issues 4 + 10; lost Issue 5) |
| Total cost | $45.30 | $38.47 | −15% |
| Manager cost | $11.31 | $8.10 | −$3.21 (−28%) |
| Subagent cost | $33.99 | $30.37 | −$3.62 (−11%) |
| Schema validation failures | 0 | 0 | held |
| CWD misroutes | 0 | 0 | held |
| Provisioning failures | 3/7 (port race) | 0 | fixed by lock |
| Wallclock | 44 min | 37 min | −7 min |
| Charters | 7 | 8 | +1 (multi-surface rule split artifact andlist) |
| # | Planted issue | Pilot 12 | Pilot 13 | Notes |
|---|---|---|---|---|
| 1 | Progress bar 100% | ✓ | ✓ | unchanged |
| 2 | Schedule time format mismatch | ✓ | ✓ | unchanged |
| 3 | Email option key mismatch | ✓ | ✓ | unchanged |
| 4 | User export hashed passwords | ✗ MISS | ✓ FOUND | Step 8.5 multi-surface rule fired correctly — split into export-artifact-andlist charter that probed user_pass independently |
| 5 | Uploads missing from "Full Backup" | ✓ | ✗ MISS | Pilot 13 regression — see "the one new miss class" below |
| 6 | No pre-restore backup | ✓ | ✓ | unchanged |
| 7 | Backups publicly accessible | ✓ | ✓ | unchanged |
| 8 | Corrupt restore truncates DB | ✓ | ✓ | unchanged |
| 9 | Large DB memory exhaustion | ✓ | ✓ | unchanged |
| 10 | Concurrent backups corrupt zip | ✗ MISS | ✓ FOUND | Step 8.6 concurrent-trigger rule fired — emitted concurrent-trigger-cross-feature charter; Tester confirmed via parallel studio wp eval |
Both Pilot 12 misses recovered. One new miss (Issue 5).
Pilot 12 used a6 = completeness (caught Issue 5: "uploads missing from full backup"). Pilot 13's planner used a6 = cleanup-on-deactivation instead. The Tester verified themes/* + plugins/* were present in the ZIP but never asked "is uploads/ here?". Same code path, different anchor framing → different finding.
Root cause: the artifact a-anchors had never been formally canonicalized. Each planner improvised the names.
Fix shipped after Pilot 13 (commit da895da): codify the canonical artifact a-anchors verbatim in skills/tester-mindset/SKILL.md:
| a1 | a2 | a3 | a4 | a5 | a6 |
|---|---|---|---|---|---|
| Location | Naming | Contents (leakage AND omissions, both mandatory) | Lifecycle (retention/rotation/cleanup-on-deactivation) | Default blast radius | Completeness against UI claim |
planner-opus.md charter-sizing table now points at this section with an explicit "Pilot 13 lost recall on Issue 5 because a6 was re-mapped" note. The next planner uses these names verbatim — cleanup-on-deactivation goes back into a4 where it belongs as a lifecycle question.
| Charter | Priority | Type | Bugs | Status |
|---|---|---|---|---|
| backup-artifact-andlist | critical | andlist | 6 | complete (8/12 turns) |
| restore-destructive-andlist | critical | andlist | 7 | complete (12/12 turns) |
| db-dump-scale-andlist | high | andlist | 3 | complete (10/12 turns) |
| schedule-settings-cluster | high | hypothesis-cluster | 5 | complete (7/8 turns) |
| schedule-x-artifact-crossfeature | high | cross-feature | 4 | complete (10/10 turns) |
| full-surface-breadth | high | breadth | 5 | complete (18/30 turns) |
| ui-cross-cutting-cluster | medium | hypothesis-cluster | 4 | complete (8/8 turns) |
| Totals | 34 | 8/8 complete (incl. new export-artifact-andlist split) |
| Component | Model | Cost |
|---|---|---|
| Manager (during run) | Sonnet 4.6 | ~$5 |
| Manager (post-run review) | Opus 4.7 | ~$3 |
| Planner Phase 1.5 + Phase 3 | Opus 4.7 | $3.90 |
| Recon Tester + 7 wave Testers | Sonnet 4.6 | $26.48 |
| Total | $38.47 |
T1 hit its target: ~30 mechanical Manager turns saved vs Pilot 12 (no per-charter file reads, no per-site jq calls, no per-charter teardown loops, no manual aggregation).
| Pilot | Driver | Mgr model | Recall | Cost | Notes |
|---|---|---|---|---|---|
| 11 (backups) | Chrome DevTools headed | Opus | 10/10 | $102.90 | Item E first run |
| 12 (backups, Sonnet-Mgr first run) | Chrome DevTools headed | Sonnet+Opus-planner | 8/10 | $45.30 | −56% cost; 2 charter-design misses |
| 13 (backups, T1 + 8 fixes validated) | Chrome DevTools headless | Sonnet+Opus-planner | 9/10 | $38.47 | Pilot 12 misses recovered; new a-anchor drift miss |
- T1 script-driven mechanical phases work: Manager cost down 28%, total cost down 15%
- Studio port lock works: 0 provisioning failures across 8 concurrent self-provisioning Testers
- Step 8.5 multi-surface rule fires correctly: Issue 4 recovered (was Pilot 12 miss)
- Step 8.6 concurrent-trigger rule fires correctly: Issue 10 recovered (was Pilot 12 miss)
- Headless mode = ~7 min wallclock save vs headed (Pilot 12)
- Improvised a-anchor names can drift between pilots → recall regression. Closed by
da895da(canonical artifact AND-list anchors). - The Manager STILL has surface area to slim — Pilot 14 ran with playwright-cli-headed and exposed a Manager cc1h cache regression that pointed at "Tester returns multi-paragraph summaries get cached at 1-hour TTL." That's the next layer of optimization.
Files:
- Final report:
runs/2026-04-27T12-28-35_magellan-backups/final-report.md - Token usage:
runs/2026-04-27T12-28-35_magellan-backups/token-usage.json - Manifest:
runs/2026-04-27T12-28-35_magellan-backups/manifest.json