Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 28, 2026 05:53
Show Gist options
  • Select an option

  • Save alopezari/6945ece27f843474ff310059e149cb71 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/6945ece27f843474ff310059e149cb71 to your computer and use it in GitHub Desktop.
Magellan Pilot 16 — magellan-backups (recon-only counterfactual to Pilot 15; Phase 1.5 skipped; 9/10 recall with Issue 9 [DB memory exhaust at scale] as the unique miss; +30% cost largely from double-recon methodology overhead)

Magellan Pilot 16 — magellan-backups (recon-only counterfactual: what does static analysis uniquely contribute?)

Run ID: 2026-04-27T18-25-45_magellan-backups Plugin: magellan-backups (same regression-test fixture; ISSUES.md stripped — blind greybox) Kind: plugin Ecosystem: core Driver: playwright-cli-headed (matched to Pilot 15 — controlled comparison) Manager model: Opus 4.7 Tester / planner model: Sonnet 4.6 (Testers, recon) + Opus 4.7 (planner Phase 3) Wallclock: 43 min (18:25:45Z → 19:08:18Z) Total cost: $72.06 (vs Pilot 15 $55.59 — +30%, inflated by methodology overhead — see below) Recall: 9/10 (Issue 9, DB memory exhaust at scale, is the unique miss)

The experiment

Pilot 16 is the recon-only counterfactual to Pilot 15. Same plugin, same MISSION, same driver, same Manager/Tester model split, same conditions in every dimension except one:

Phase 1.5 (static analysis) was intentionally skipped. The planner-opus subagent was forbidden from reading any source code in Phase 3. Charter generation had to rely solely on recon.md + WP-core knowledge.

The point: isolate the unique contribution of source-grounded static analysis to recall. If the same charters and the same bugs surface from recon-only inputs, Phase 1.5 is redundant. If specific bug categories disappear, that's what static analysis is buying.

TL;DR — recon-only loses exactly one bug class: scale-sensitive source patterns

Metric Pilot 15 (with Phase 1.5) Pilot 16 (recon-only) Δ
Recall 9/10 9/10 0 (different miss)
Issue 9 (SELECT * OOM at scale) ✅ caught ❌ MISS −1
Issue 3 (email option-key mismatch) ❌ missed ✅ caught +1
Other 8 planted bugs all caught all caught held
Total cost $55.59 $72.06 +$16.47 (+30%)
Manager cost $24.57 (39 msgs) $35.72 (88 msgs) +$11.15
Subagent cost $31.02 (13 sessions) $36.34 (20 sessions) +$5.32
Wallclock 34 min 43 min +9 min
Charters 10 8 −2
Total problems filed 34 28 −6
Cost / planted bug caught $6.18 $8.01 +30%

The one structural finding: Issue 9 was the only bug Pilot 15 caught and Pilot 16 missed. It's the textbook scale-sensitive pattern — SELECT * FROM wp_posts materializing the entire posts table into memory before serializing to SQL — a bug that's invisible to runtime exploration on a fresh test site (small dataset = no OOM) and only visible to source-grounded analysis ("this code path scales O(n) with table size — flag for seed-at-scale probe").

Issue 3 swap is noise, not signal. Pilot 16's recon Tester walked the schedule form thoroughly enough to observe the email field reload empty after save (a runtime-observable symptom). Pilot 15's planner had hypothesis-cluster charters that didn't include round-trip checks for that field. Both bugs are reachable both ways — the planner just didn't draw the right charter. This is regression to the mean, not evidence about static analysis.

Why the cost regressed (and most of it is methodology overhead, not steady-state cost)

The +$16.47 / +30% looks bad but decomposes into one-time experimental overhead vs structural penalty:

Methodology overhead (~$10–12, experiment-only)

  • Double recon. The first recon attempt was source-contaminated — the recon Tester read plugin source files mid-mapping, which would have leaked source-grounded knowledge into the charter generation phase and defeated the experiment. We threw it away (saved as recon.md.contaminated-attempt-1) and ran a second clean recon under stricter prompt constraints. Pilot 15 ran recon once. The 20 subagent sessions in Pilot 16 vs Pilot 15's 13 is mostly this — extra recon attempt + validator round-trips to enforce decontamination + redundancy in dispatch as the Manager second-guessed itself.
  • Validator roundtrip on recon output. To enforce source-decontamination we had to inspect recon's output, sometimes re-dispatch. Pure overhead from running this as a controlled experiment.

Structural penalty (~$3–5, would persist in any steady-state recon-only configuration)

  • Testers explore instead of verify. With static analysis, Phase 3 charters carry numbered hypotheses pointing at specific source-grounded probes (e.g., "save handler writes to magellan_backups_email, form reads from mb_schedule_email — verify round-trip in admin.php?page=mb-schedule"). The Tester opens the form, saves, observes mismatch — done in 2–3 turns. Without source-grounded hypotheses, the Tester has to discover the bug by exploring the form's full lifecycle, which takes 6–10 turns. Schedule-feature-cluster Tester used 65 messages in Pilot 16 vs an equivalent verification cluster at ~40 messages in Pilot 15.
  • Per-Tester message counts climbed. Pilot 16 Sonnet Testers averaged 1027 messages / 7 wave-Testers = 147 msg/Tester. Pilot 15 averaged 864 / 7 = 123 msg/Tester. The +24 msg/Tester premium × 7 Testers × Sonnet rate is the structural penalty.
  • MCP serialization rule still applies. Wave wallclock is bounded by the longest Tester. In Pilot 16, exploration-mode Testers run longer, so the longest-Tester ceiling rose by ~3–5 min, which is most of the +9 min wallclock delta.

Net

Methodology overhead is one-time noise. A clean steady-state recon-only configuration would project to ~$58–60 (Pilot 15 + structural penalty). Recon-only is slightly more expensive than Phase 1.5 + recon, not less, because the work doesn't disappear — it migrates from the planner into Tester exploration time, where each token costs more (Sonnet exploration vs Opus synthesis at much smaller volume).

Recall delta — Issue 9 in detail

magellan-backups ships a known scale-sensitive bug in class-mb-export.php:

// SELECT * with no batching — materializes entire table into memory
$results = $wpdb->get_results( "SELECT * FROM {$wpdb->posts}", ARRAY_A );
foreach ( $results as $row ) {
    fwrite( $sql_file, $this->build_insert_statement( $row ) );
}

On a fresh WordPress install (~6 posts, ~3 pages, ~1 user), this completes in 100ms with no observable issues. On a production site with 50K posts, it OOMs PHP and the export silently truncates.

Pilot 15 caught this via Phase 1.5: static analysis flagged the get_results() pattern, the planner emitted a scale-sensitive-cluster charter with explicit "seed at production scale (10K+ posts) and re-run export" instructions, the Tester executed it, the export crashed at memory_limit, problem filed.

Pilot 16 missed this. The recon Tester observed exports complete normally (it had no reason to seed a 10K-post fixture). No source = no signal that the export code is structurally O(n) on memory. The planner generated export-artifact-andlist charters focused on artifact contents, naming, and lifecycle (the artifact AND-list anchors a1–a6) — perfectly reasonable from recon-only inputs, but invisible to the scale dimension.

This is the cleanest possible empirical evidence for what static analysis uniquely contributes to recall. Every other planted bug is reachable from runtime observation if the Tester explores deeply enough. Scale-sensitivity is not.

Per-charter PQIP

Charter Type Priority Bugs (P) Q I Pr Severity breakdown
backup-artifact-andlist andlist critical 6 1 4 0 2 critical, 3 major, 1 minor
export-artifact-andlist andlist critical 6 1 2 0 2 critical, 2 major, 2 minor
upload-restore-destructive-andlist andlist critical 4 1 2 2 1 critical, 1 major, 2 minor
schedule-feature-cluster hypothesis-cluster high 3 1 1 1 3 major
concurrent-trigger-seam cross-feature high 1 1 1 0 1 major
csrf-cross-cutting hypothesis-cluster high 1 1 2 1 1 major
breadth-tour-admin breadth medium 4 1 2 3 1 major, 3 minor
breadth-tour-frontend-lifecycle breadth medium 3 1 3 1 1 critical, 1 major, 1 minor
Totals 28 8 17 8 6 critical, 13 major, 9 minor

8/8 charters complete. 1 schema-validation failure during aggregation (recovered via re-validate); zero session-level failures.

Token routing — exact

Manager (Opus 4.7, 88 messages)

Metric Tokens $
input 218 $0.00
output 191,578 $2.87
cc5m (5-min cache write) 0 $0.00
cc1h (1-hour cache write) 138,434 $1.38
cr (cache read) 59,093,214 $31.46
Manager total $35.72

Subagents — 20 sessions (recon ×2 + planner ×1 + 8 wave Testers + dispatch overhead)

Model Sessions Messages Output tok cc5m cr $
Sonnet 4.6 (Testers + recon) 19 1,027 191,677 1,613,963 78,980,910 $32.63
Opus 4.7 (planner) 1 29 39,916 363,083 890,595 $3.71
Subagent total 20 1,056 $36.34

Cost by category (whole run)

Category $
input $0.00
output $8.66
cc5m $8.32
cc1h $1.38
cr $53.69
Total $72.06

Cache reads dominate (74% of total) — the Manager's conversation context is large (driver SKILL files, recon.md, all 8 charter briefs, validator output) and gets re-cached as the wave progresses.

Per-Tester wave timings (rough — derived from subagent first_timestamp)

The wave dispatched at ~18:51:48Z and the longest Tester finished at ~19:08:18Z. Wave wallclock: ~16 min (gated by the longest Tester per the MCP serialization rule).

Tester (approx by msg count, no per-charter ID exposed) Messages Cost Notes
1 100 $4.10 Likely longest charter (andlist or breadth-frontend)
2 94 $2.44
3 86 $2.19
4 84 $3.36 Higher cost/msg suggests Opus turns mixed in
5 77 $2.76
6 72 $2.31
7 71 $1.88
8 68 $1.94
9 67 $2.29
10 65 $1.68
11 64 $2.03
12 64 $1.65
13 52 $1.74
14 25 $0.86
15 18 $2.92 High cost / low msg = planner phase
16 15 $0.49
17 13 $0.49
18 11 $0.79
19 9 $0.38 Validator roundtrip
20 1 $0.03 Schema-validation re-run

The bottom 7 sessions (≤25 msgs) are validator roundtrips, dispatch retries, and the second clean recon — the methodology-overhead population.

Cross-pilot arc

Pilot Driver Phase 1.5 Mgr cost Sub cost Total Recall Notes
11 Chrome DevTools headed $102.90 10/10 Item E first run
12 Chrome DevTools headed $45.30 8/10 Sonnet-Mgr first run; charter-design misses
13 Chrome DevTools headless $8.10 $30.37 $38.47 9/10 a-anchor drift miss
14 Playwright CLI (headed) $25.42 $27.64 $53.06 10/10 First clean recall under new arch
15 Playwright CLI (headed) $24.57 $31.02 $55.59 9/10 Issue 3 missed (charter-design noise)
16 Playwright CLI (headed) $35.72 $36.34 $72.06 9/10 Recon-only counterfactual; Issue 9 (scale) missed; +30% cost from double-recon methodology overhead

What Pilot 16 settled

  1. Static analysis uniquely contributes scale-sensitivity catches. Issue 9 (DB memory exhaust on SELECT * against unbounded tables) is invisible to runtime exploration on a fresh fixture and visible only to source-grounded analysis. This is now empirically grounded, not asserted.

  2. Recon-only is not free. Skipping Phase 1.5 saves ~30s of Manager Opus time but loads a 3–5 min Tester-exploration penalty downstream + 1/10 recall hole on the scale dimension. Recon-only is slightly more expensive than Phase 1.5 + recon at steady state, not cheaper.

  3. 9/10 is reachable from recon alone on this fixture, which means most of the bugs Magellan catches are not gated by static analysis — they're gated by good recon, anchored charters, and Tester exploration discipline. Phase 1.5 is the marginal +1 issue insurance, not the engine.

Open questions

  • Does scale-sensitivity always need static analysis? A planner that tags any "data-export-shaped" recon observation with seed-at-scale could in principle catch Issue 9 from recon alone. The current planner doesn't do this. Worth a Pilot 17 to test.
  • Are there other bug categories like Issue 9? Race conditions, lock-ordering bugs, and silent-failure paths might be similarly invisible to runtime exploration. Worth seeding a more exotic fixture and re-running both pilots.
  • Should Phase 1.5 be conditional? If recon flags any artifact-producing or data-export feature, run static analysis. If recon flags only UI/admin-state features, skip it. Worth instrumenting.

Reproducibility

  • Run dir: runs/2026-04-27T18-25-45_magellan-backups/
  • Mission snapshot: runs/<id>/mission.md
  • Recon snapshots: recon.md (clean) + recon.md.contaminated-attempt-1 (discarded)
  • Charter set: runs/<id>/charters/ (8 files)
  • Reports: runs/<id>/sessions/<charter>/report.json
  • Token usage: runs/<id>/token-usage.json
  • Aggregated: runs/<id>/final-report.md
  • Experiment note in manifest: "Pilot 16 — recon-only counterfactual to Pilot 15. Phase 1.5 (static analysis) intentionally skipped. Planner-opus FORBIDDEN from reading any source code in Phase 3. Charter generation must rely solely on recon.md + WP-core knowledge."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment