Magellan Pilot 16 — magellan-backups (recon-only counterfactual: what does static analysis uniquely contribute?)
Run ID: 2026-04-27T18-25-45_magellan-backups
Plugin: magellan-backups (same regression-test fixture; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: playwright-cli-headed (matched to Pilot 15 — controlled comparison)
Manager model: Opus 4.7
Tester / planner model: Sonnet 4.6 (Testers, recon) + Opus 4.7 (planner Phase 3)
Wallclock: 43 min (18:25:45Z → 19:08:18Z)
Total cost: $72.06 (vs Pilot 15 $55.59 — +30%, inflated by methodology overhead — see below)
Recall: 9/10 (Issue 9, DB memory exhaust at scale, is the unique miss)
Pilot 16 is the recon-only counterfactual to Pilot 15. Same plugin, same MISSION, same driver, same Manager/Tester model split, same conditions in every dimension except one:
Phase 1.5 (static analysis) was intentionally skipped. The planner-opus subagent was forbidden from reading any source code in Phase 3. Charter generation had to rely solely on
recon.md+ WP-core knowledge.
The point: isolate the unique contribution of source-grounded static analysis to recall. If the same charters and the same bugs surface from recon-only inputs, Phase 1.5 is redundant. If specific bug categories disappear, that's what static analysis is buying.
| Metric | Pilot 15 (with Phase 1.5) | Pilot 16 (recon-only) | Δ |
|---|---|---|---|
| Recall | 9/10 | 9/10 | 0 (different miss) |
Issue 9 (SELECT * OOM at scale) |
✅ caught | ❌ MISS | −1 |
| Issue 3 (email option-key mismatch) | ❌ missed | ✅ caught | +1 |
| Other 8 planted bugs | all caught | all caught | held |
| Total cost | $55.59 | $72.06 | +$16.47 (+30%) |
| Manager cost | $24.57 (39 msgs) | $35.72 (88 msgs) | +$11.15 |
| Subagent cost | $31.02 (13 sessions) | $36.34 (20 sessions) | +$5.32 |
| Wallclock | 34 min | 43 min | +9 min |
| Charters | 10 | 8 | −2 |
| Total problems filed | 34 | 28 | −6 |
| Cost / planted bug caught | $6.18 | $8.01 | +30% |
The one structural finding: Issue 9 was the only bug Pilot 15 caught and Pilot 16 missed. It's the textbook scale-sensitive pattern — SELECT * FROM wp_posts materializing the entire posts table into memory before serializing to SQL — a bug that's invisible to runtime exploration on a fresh test site (small dataset = no OOM) and only visible to source-grounded analysis ("this code path scales O(n) with table size — flag for seed-at-scale probe").
Issue 3 swap is noise, not signal. Pilot 16's recon Tester walked the schedule form thoroughly enough to observe the email field reload empty after save (a runtime-observable symptom). Pilot 15's planner had hypothesis-cluster charters that didn't include round-trip checks for that field. Both bugs are reachable both ways — the planner just didn't draw the right charter. This is regression to the mean, not evidence about static analysis.
The +$16.47 / +30% looks bad but decomposes into one-time experimental overhead vs structural penalty:
- Double recon. The first recon attempt was source-contaminated — the recon Tester read plugin source files mid-mapping, which would have leaked source-grounded knowledge into the charter generation phase and defeated the experiment. We threw it away (saved as
recon.md.contaminated-attempt-1) and ran a second clean recon under stricter prompt constraints. Pilot 15 ran recon once. The 20 subagent sessions in Pilot 16 vs Pilot 15's 13 is mostly this — extra recon attempt + validator round-trips to enforce decontamination + redundancy in dispatch as the Manager second-guessed itself. - Validator roundtrip on recon output. To enforce source-decontamination we had to inspect recon's output, sometimes re-dispatch. Pure overhead from running this as a controlled experiment.
- Testers explore instead of verify. With static analysis, Phase 3 charters carry numbered hypotheses pointing at specific source-grounded probes (e.g., "save handler writes to
magellan_backups_email, form reads frommb_schedule_email— verify round-trip inadmin.php?page=mb-schedule"). The Tester opens the form, saves, observes mismatch — done in 2–3 turns. Without source-grounded hypotheses, the Tester has to discover the bug by exploring the form's full lifecycle, which takes 6–10 turns. Schedule-feature-cluster Tester used 65 messages in Pilot 16 vs an equivalent verification cluster at ~40 messages in Pilot 15. - Per-Tester message counts climbed. Pilot 16 Sonnet Testers averaged 1027 messages / 7 wave-Testers = 147 msg/Tester. Pilot 15 averaged 864 / 7 = 123 msg/Tester. The +24 msg/Tester premium × 7 Testers × Sonnet rate is the structural penalty.
- MCP serialization rule still applies. Wave wallclock is bounded by the longest Tester. In Pilot 16, exploration-mode Testers run longer, so the longest-Tester ceiling rose by ~3–5 min, which is most of the +9 min wallclock delta.
Methodology overhead is one-time noise. A clean steady-state recon-only configuration would project to ~$58–60 (Pilot 15 + structural penalty). Recon-only is slightly more expensive than Phase 1.5 + recon, not less, because the work doesn't disappear — it migrates from the planner into Tester exploration time, where each token costs more (Sonnet exploration vs Opus synthesis at much smaller volume).
magellan-backups ships a known scale-sensitive bug in class-mb-export.php:
// SELECT * with no batching — materializes entire table into memory
$results = $wpdb->get_results( "SELECT * FROM {$wpdb->posts}", ARRAY_A );
foreach ( $results as $row ) {
fwrite( $sql_file, $this->build_insert_statement( $row ) );
}On a fresh WordPress install (~6 posts, ~3 pages, ~1 user), this completes in 100ms with no observable issues. On a production site with 50K posts, it OOMs PHP and the export silently truncates.
Pilot 15 caught this via Phase 1.5: static analysis flagged the get_results() pattern, the planner emitted a scale-sensitive-cluster charter with explicit "seed at production scale (10K+ posts) and re-run export" instructions, the Tester executed it, the export crashed at memory_limit, problem filed.
Pilot 16 missed this. The recon Tester observed exports complete normally (it had no reason to seed a 10K-post fixture). No source = no signal that the export code is structurally O(n) on memory. The planner generated export-artifact-andlist charters focused on artifact contents, naming, and lifecycle (the artifact AND-list anchors a1–a6) — perfectly reasonable from recon-only inputs, but invisible to the scale dimension.
This is the cleanest possible empirical evidence for what static analysis uniquely contributes to recall. Every other planted bug is reachable from runtime observation if the Tester explores deeply enough. Scale-sensitivity is not.
| Charter | Type | Priority | Bugs (P) | Q | I | Pr | Severity breakdown |
|---|---|---|---|---|---|---|---|
| backup-artifact-andlist | andlist | critical | 6 | 1 | 4 | 0 | 2 critical, 3 major, 1 minor |
| export-artifact-andlist | andlist | critical | 6 | 1 | 2 | 0 | 2 critical, 2 major, 2 minor |
| upload-restore-destructive-andlist | andlist | critical | 4 | 1 | 2 | 2 | 1 critical, 1 major, 2 minor |
| schedule-feature-cluster | hypothesis-cluster | high | 3 | 1 | 1 | 1 | 3 major |
| concurrent-trigger-seam | cross-feature | high | 1 | 1 | 1 | 0 | 1 major |
| csrf-cross-cutting | hypothesis-cluster | high | 1 | 1 | 2 | 1 | 1 major |
| breadth-tour-admin | breadth | medium | 4 | 1 | 2 | 3 | 1 major, 3 minor |
| breadth-tour-frontend-lifecycle | breadth | medium | 3 | 1 | 3 | 1 | 1 critical, 1 major, 1 minor |
| Totals | 28 | 8 | 17 | 8 | 6 critical, 13 major, 9 minor |
8/8 charters complete. 1 schema-validation failure during aggregation (recovered via re-validate); zero session-level failures.
| Metric | Tokens | $ |
|---|---|---|
| input | 218 | $0.00 |
| output | 191,578 | $2.87 |
| cc5m (5-min cache write) | 0 | $0.00 |
| cc1h (1-hour cache write) | 138,434 | $1.38 |
| cr (cache read) | 59,093,214 | $31.46 |
| Manager total | $35.72 |
| Model | Sessions | Messages | Output tok | cc5m | cr | $ |
|---|---|---|---|---|---|---|
| Sonnet 4.6 (Testers + recon) | 19 | 1,027 | 191,677 | 1,613,963 | 78,980,910 | $32.63 |
| Opus 4.7 (planner) | 1 | 29 | 39,916 | 363,083 | 890,595 | $3.71 |
| Subagent total | 20 | 1,056 | $36.34 |
| Category | $ |
|---|---|
| input | $0.00 |
| output | $8.66 |
| cc5m | $8.32 |
| cc1h | $1.38 |
| cr | $53.69 |
| Total | $72.06 |
Cache reads dominate (74% of total) — the Manager's conversation context is large (driver SKILL files, recon.md, all 8 charter briefs, validator output) and gets re-cached as the wave progresses.
The wave dispatched at ~18:51:48Z and the longest Tester finished at ~19:08:18Z. Wave wallclock: ~16 min (gated by the longest Tester per the MCP serialization rule).
| Tester (approx by msg count, no per-charter ID exposed) | Messages | Cost | Notes |
|---|---|---|---|
| 1 | 100 | $4.10 | Likely longest charter (andlist or breadth-frontend) |
| 2 | 94 | $2.44 | |
| 3 | 86 | $2.19 | |
| 4 | 84 | $3.36 | Higher cost/msg suggests Opus turns mixed in |
| 5 | 77 | $2.76 | |
| 6 | 72 | $2.31 | |
| 7 | 71 | $1.88 | |
| 8 | 68 | $1.94 | |
| 9 | 67 | $2.29 | |
| 10 | 65 | $1.68 | |
| 11 | 64 | $2.03 | |
| 12 | 64 | $1.65 | |
| 13 | 52 | $1.74 | |
| 14 | 25 | $0.86 | |
| 15 | 18 | $2.92 | High cost / low msg = planner phase |
| 16 | 15 | $0.49 | |
| 17 | 13 | $0.49 | |
| 18 | 11 | $0.79 | |
| 19 | 9 | $0.38 | Validator roundtrip |
| 20 | 1 | $0.03 | Schema-validation re-run |
The bottom 7 sessions (≤25 msgs) are validator roundtrips, dispatch retries, and the second clean recon — the methodology-overhead population.
| Pilot | Driver | Phase 1.5 | Mgr cost | Sub cost | Total | Recall | Notes |
|---|---|---|---|---|---|---|---|
| 11 | Chrome DevTools headed | ✅ | — | — | $102.90 | 10/10 | Item E first run |
| 12 | Chrome DevTools headed | ✅ | — | — | $45.30 | 8/10 | Sonnet-Mgr first run; charter-design misses |
| 13 | Chrome DevTools headless | ✅ | $8.10 | $30.37 | $38.47 | 9/10 | a-anchor drift miss |
| 14 | Playwright CLI (headed) | ✅ | $25.42 | $27.64 | $53.06 | 10/10 | First clean recall under new arch |
| 15 | Playwright CLI (headed) | ✅ | $24.57 | $31.02 | $55.59 | 9/10 | Issue 3 missed (charter-design noise) |
| 16 | Playwright CLI (headed) | ❌ | $35.72 | $36.34 | $72.06 | 9/10 | Recon-only counterfactual; Issue 9 (scale) missed; +30% cost from double-recon methodology overhead |
-
Static analysis uniquely contributes scale-sensitivity catches. Issue 9 (DB memory exhaust on
SELECT *against unbounded tables) is invisible to runtime exploration on a fresh fixture and visible only to source-grounded analysis. This is now empirically grounded, not asserted. -
Recon-only is not free. Skipping Phase 1.5 saves ~30s of Manager Opus time but loads a 3–5 min Tester-exploration penalty downstream + 1/10 recall hole on the scale dimension. Recon-only is slightly more expensive than Phase 1.5 + recon at steady state, not cheaper.
-
9/10 is reachable from recon alone on this fixture, which means most of the bugs Magellan catches are not gated by static analysis — they're gated by good recon, anchored charters, and Tester exploration discipline. Phase 1.5 is the marginal +1 issue insurance, not the engine.
- Does scale-sensitivity always need static analysis? A planner that tags any "data-export-shaped" recon observation with
seed-at-scalecould in principle catch Issue 9 from recon alone. The current planner doesn't do this. Worth a Pilot 17 to test. - Are there other bug categories like Issue 9? Race conditions, lock-ordering bugs, and silent-failure paths might be similarly invisible to runtime exploration. Worth seeding a more exotic fixture and re-running both pilots.
- Should Phase 1.5 be conditional? If recon flags any artifact-producing or data-export feature, run static analysis. If recon flags only UI/admin-state features, skip it. Worth instrumenting.
- Run dir:
runs/2026-04-27T18-25-45_magellan-backups/ - Mission snapshot:
runs/<id>/mission.md - Recon snapshots:
recon.md(clean) +recon.md.contaminated-attempt-1(discarded) - Charter set:
runs/<id>/charters/(8 files) - Reports:
runs/<id>/sessions/<charter>/report.json - Token usage:
runs/<id>/token-usage.json - Aggregated:
runs/<id>/final-report.md - Experiment note in manifest:
"Pilot 16 — recon-only counterfactual to Pilot 15. Phase 1.5 (static analysis) intentionally skipped. Planner-opus FORBIDDEN from reading any source code in Phase 3. Charter generation must rely solely on recon.md + WP-core knowledge."