Magellan Pilot 16 — magellan-backups (recon-only counterfactual: what does static analysis uniquely contribute?)

Run ID: 2026-04-27T18-25-45_magellan-backups Plugin: magellan-backups (same regression-test fixture; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: playwright-cli-headed (matched to Pilot 15 — controlled comparison) Manager model: Opus 4.7 Tester / planner model: Sonnet 4.6 (Testers, recon) + Opus 4.7 (planner Phase 3) Wallclock: 43 min (18:25:45Z → 19:08:18Z) Total cost: $72.06 (vs Pilot 15 $55.59 — +30%, inflated by methodology overhead — see below) Recall: 9/10 (Issue 9, DB memory exhaust at scale, is the unique miss)

The experiment

Pilot 16 is the recon-only counterfactual to Pilot 15. Same plugin, same MISSION, same driver, same Manager/Tester model split, same conditions in every dimension except one:

Phase 1.5 (static analysis) was intentionally skipped. The planner-opus subagent was forbidden from reading any source code in Phase 3. Charter generation had to rely solely on recon.md + WP-core knowledge.

The point: isolate the unique contribution of source-grounded static analysis to recall. If the same charters and the same bugs surface from recon-only inputs, Phase 1.5 is redundant. If specific bug categories disappear, that's what static analysis is buying.

TL;DR — recon-only loses exactly one bug class: scale-sensitive source patterns

Metric	Pilot 15 (with Phase 1.5)	Pilot 16 (recon-only)	Δ
Recall	9/10	9/10	0 (different miss)
Issue 9 (`SELECT *` OOM at scale)	✅ caught	❌ MISS	−1
Issue 3 (email option-key mismatch)	❌ missed	✅ caught	+1
Other 8 planted bugs	all caught	all caught	held
Total cost	$55.59	$72.06	+$16.47 (+30%)
Manager cost	$24.57 (39 msgs)	$35.72 (88 msgs)	+$11.15
Subagent cost	$31.02 (13 sessions)	$36.34 (20 sessions)	+$5.32
Wallclock	34 min	43 min	+9 min
Charters	10	8	−2
Total problems filed	34	28	−6
Cost / planted bug caught	$6.18	$8.01	+30%

The one structural finding: Issue 9 was the only bug Pilot 15 caught and Pilot 16 missed. It's the textbook scale-sensitive pattern — SELECT * FROM wp_posts materializing the entire posts table into memory before serializing to SQL — a bug that's invisible to runtime exploration on a fresh test site (small dataset = no OOM) and only visible to source-grounded analysis ("this code path scales O(n) with table size — flag for seed-at-scale probe").

Issue 3 swap is noise, not signal. Pilot 16's recon Tester walked the schedule form thoroughly enough to observe the email field reload empty after save (a runtime-observable symptom). Pilot 15's planner had hypothesis-cluster charters that didn't include round-trip checks for that field. Both bugs are reachable both ways — the planner just didn't draw the right charter. This is regression to the mean, not evidence about static analysis.

Why the cost regressed (and most of it is methodology overhead, not steady-state cost)

The +$16.47 / +30% looks bad but decomposes into one-time experimental overhead vs structural penalty:

Methodology overhead (~$10–12, experiment-only)

Double recon. The first recon attempt was source-contaminated — the recon Tester read plugin source files mid-mapping, which would have leaked source-grounded knowledge into the charter generation phase and defeated the experiment. We threw it away (saved as recon.md.contaminated-attempt-1) and ran a second clean recon under stricter prompt constraints. Pilot 15 ran recon once. The 20 subagent sessions in Pilot 16 vs Pilot 15's 13 is mostly this — extra recon attempt + validator round-trips to enforce decontamination + redundancy in dispatch as the Manager second-guessed itself.
Validator roundtrip on recon output. To enforce source-decontamination we had to inspect recon's output, sometimes re-dispatch. Pure overhead from running this as a controlled experiment.

Structural penalty (~$3–5, would persist in any steady-state recon-only configuration)

Testers explore instead of verify. With static analysis, Phase 3 charters carry numbered hypotheses pointing at specific source-grounded probes (e.g., "save handler writes to magellan_backups_email, form reads from mb_schedule_email — verify round-trip in admin.php?page=mb-schedule"). The Tester opens the form, saves, observes mismatch — done in 2–3 turns. Without source-grounded hypotheses, the Tester has to discover the bug by exploring the form's full lifecycle, which takes 6–10 turns. Schedule-feature-cluster Tester used 65 messages in Pilot 16 vs an equivalent verification cluster at ~40 messages in Pilot 15.
Per-Tester message counts climbed. Pilot 16 Sonnet Testers averaged 1027 messages / 7 wave-Testers = 147 msg/Tester. Pilot 15 averaged 864 / 7 = 123 msg/Tester. The +24 msg/Tester premium × 7 Testers × Sonnet rate is the structural penalty.
MCP serialization rule still applies. Wave wallclock is bounded by the longest Tester. In Pilot 16, exploration-mode Testers run longer, so the longest-Tester ceiling rose by ~3–5 min, which is most of the +9 min wallclock delta.

Net

Methodology overhead is one-time noise. A clean steady-state recon-only configuration would project to ~$58–60 (Pilot 15 + structural penalty). Recon-only is slightly more expensive than Phase 1.5 + recon, not less, because the work doesn't disappear — it migrates from the planner into Tester exploration time, where each token costs more (Sonnet exploration vs Opus synthesis at much smaller volume).

Recall delta — Issue 9 in detail

magellan-backups ships a known scale-sensitive bug in class-mb-export.php:

// SELECT * with no batching — materializes entire table into memory
$results = $wpdb->get_results( "SELECT * FROM {$wpdb->posts}", ARRAY_A );
foreach ( $results as $row ) {
    fwrite( $sql_file, $this->build_insert_statement( $row ) );
}

On a fresh WordPress install (~6 posts, ~3 pages, ~1 user), this completes in 100ms with no observable issues. On a production site with 50K posts, it OOMs PHP and the export silently truncates.

Pilot 15 caught this via Phase 1.5: static analysis flagged the get_results() pattern, the planner emitted a scale-sensitive-cluster charter with explicit "seed at production scale (10K+ posts) and re-run export" instructions, the Tester executed it, the export crashed at memory_limit, problem filed.

Pilot 16 missed this. The recon Tester observed exports complete normally (it had no reason to seed a 10K-post fixture). No source = no signal that the export code is structurally O(n) on memory. The planner generated export-artifact-andlist charters focused on artifact contents, naming, and lifecycle (the artifact AND-list anchors a1–a6) — perfectly reasonable from recon-only inputs, but invisible to the scale dimension.

This is the cleanest possible empirical evidence for what static analysis uniquely contributes to recall. Every other planted bug is reachable from runtime observation if the Tester explores deeply enough. Scale-sensitivity is not.

Per-charter PQIP

Charter	Type	Priority	Bugs (P)	Q	I	Pr	Severity breakdown
backup-artifact-andlist	andlist	critical	6	1	4	0	2 critical, 3 major, 1 minor
export-artifact-andlist	andlist	critical	6	1	2	0	2 critical, 2 major, 2 minor
upload-restore-destructive-andlist	andlist	critical	4	1	2	2	1 critical, 1 major, 2 minor
schedule-feature-cluster	hypothesis-cluster	high	3	1	1	1	3 major
concurrent-trigger-seam	cross-feature	high	1	1	1	0	1 major
csrf-cross-cutting	hypothesis-cluster	high	1	1	2	1	1 major
breadth-tour-admin	breadth	medium	4	1	2	3	1 major, 3 minor
breadth-tour-frontend-lifecycle	breadth	medium	3	1	3	1	1 critical, 1 major, 1 minor
Totals			28	8	17	8	6 critical, 13 major, 9 minor

8/8 charters complete. 1 schema-validation failure during aggregation (recovered via re-validate); zero session-level failures.

Token routing — exact

Manager (Opus 4.7, 88 messages)

Metric	Tokens	$
input	218	$0.00
output	191,578	$2.87
cc5m (5-min cache write)	0	$0.00
cc1h (1-hour cache write)	138,434	$1.38
cr (cache read)	59,093,214	$31.46
Manager total		$35.72

Subagents — 20 sessions (recon ×2 + planner ×1 + 8 wave Testers + dispatch overhead)

Model	Sessions	Messages	Output tok	cc5m	cr	$
Sonnet 4.6 (Testers + recon)	19	1,027	191,677	1,613,963	78,980,910	$32.63
Opus 4.7 (planner)	1	29	39,916	363,083	890,595	$3.71
Subagent total	20	1,056				$36.34

Cost by category (whole run)

Category	$
input	$0.00
output	$8.66
cc5m	$8.32
cc1h	$1.38
cr	$53.69
Total	$72.06

Cache reads dominate (74% of total) — the Manager's conversation context is large (driver SKILL files, recon.md, all 8 charter briefs, validator output) and gets re-cached as the wave progresses.

Per-Tester wave timings (rough — derived from subagent first_timestamp)

The wave dispatched at ~18:51:48Z and the longest Tester finished at ~19:08:18Z. Wave wallclock: ~16 min (gated by the longest Tester per the MCP serialization rule).

Tester (approx by msg count, no per-charter ID exposed)	Messages	Cost	Notes
1	100	$4.10	Likely longest charter (andlist or breadth-frontend)
2	94	$2.44
3	86	$2.19
4	84	$3.36	Higher cost/msg suggests Opus turns mixed in
5	77	$2.76
6	72	$2.31
7	71	$1.88
8	68	$1.94
9	67	$2.29
10	65	$1.68
11	64	$2.03
12	64	$1.65
13	52	$1.74
14	25	$0.86
15	18	$2.92	High cost / low msg = planner phase
16	15	$0.49
17	13	$0.49
18	11	$0.79
19	9	$0.38	Validator roundtrip
20	1	$0.03	Schema-validation re-run

The bottom 7 sessions (≤25 msgs) are validator roundtrips, dispatch retries, and the second clean recon — the methodology-overhead population.

Cross-pilot arc

Pilot	Driver	Phase 1.5	Mgr cost	Sub cost	Total	Recall	Notes
11	Chrome DevTools headed	✅	—	—	$102.90	10/10	Item E first run
12	Chrome DevTools headed	✅	—	—	$45.30	8/10	Sonnet-Mgr first run; charter-design misses
13	Chrome DevTools headless	✅	$8.10	$30.37	$38.47	9/10	a-anchor drift miss
14	Playwright CLI (headed)	✅	$25.42	$27.64	$53.06	10/10	First clean recall under new arch
15	Playwright CLI (headed)	✅	$24.57	$31.02	$55.59	9/10	Issue 3 missed (charter-design noise)
16	Playwright CLI (headed)	❌	$35.72	$36.34	$72.06	9/10	Recon-only counterfactual; Issue 9 (scale) missed; +30% cost from double-recon methodology overhead

What Pilot 16 settled

Static analysis uniquely contributes scale-sensitivity catches. Issue 9 (DB memory exhaust on SELECT * against unbounded tables) is invisible to runtime exploration on a fresh fixture and visible only to source-grounded analysis. This is now empirically grounded, not asserted.
Recon-only is not free. Skipping Phase 1.5 saves ~30s of Manager Opus time but loads a 3–5 min Tester-exploration penalty downstream + 1/10 recall hole on the scale dimension. Recon-only is slightly more expensive than Phase 1.5 + recon at steady state, not cheaper.
9/10 is reachable from recon alone on this fixture, which means most of the bugs Magellan catches are not gated by static analysis — they're gated by good recon, anchored charters, and Tester exploration discipline. Phase 1.5 is the marginal +1 issue insurance, not the engine.

Open questions

Does scale-sensitivity always need static analysis? A planner that tags any "data-export-shaped" recon observation with seed-at-scale could in principle catch Issue 9 from recon alone. The current planner doesn't do this. Worth a Pilot 17 to test.
Are there other bug categories like Issue 9? Race conditions, lock-ordering bugs, and silent-failure paths might be similarly invisible to runtime exploration. Worth seeding a more exotic fixture and re-running both pilots.
Should Phase 1.5 be conditional? If recon flags any artifact-producing or data-export feature, run static analysis. If recon flags only UI/admin-state features, skip it. Worth instrumenting.

Reproducibility

Run dir: runs/2026-04-27T18-25-45_magellan-backups/
Mission snapshot: runs/<id>/mission.md
Recon snapshots: recon.md (clean) + recon.md.contaminated-attempt-1 (discarded)
Charter set: runs/<id>/charters/ (8 files)
Reports: runs/<id>/sessions/<charter>/report.json
Token usage: runs/<id>/token-usage.json
Aggregated: runs/<id>/final-report.md
Experiment note in manifest: "Pilot 16 — recon-only counterfactual to Pilot 15. Phase 1.5 (static analysis) intentionally skipped. Planner-opus FORBIDDEN from reading any source code in Phase 3. Charter generation must rely solely on recon.md + WP-core knowledge."

alopezari/pilot-16-gist.md

Select an option

No results found