Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 24, 2026 13:37
Show Gist options
  • Select an option

  • Save alopezari/8f021eb8f5efa7c1f546f52d6540a8de to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/8f021eb8f5efa7c1f546f52d6540a8de to your computer and use it in GitHub Desktop.
Magellan Pilot 7 — magellan-speed (first SFDPOT Time pilot, 8/10 blind; Amendments G/H/I validated; new Amendment J born from a recon scope violation)

Magellan Pilot 7 — magellan-speed (first SFDPOT Time pilot; Amendment I validation)

Run ID: 2026-04-24T12-59-33_magellan-speed Plugin: magellan-speed v1.0.0 — page cache + minify/combine + DB cleanup + lazy-load Ecosystem: core (no baseline plugins) Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters + 2 recons (one blind-violation, one clean retry) in one wave (Sonnet-default) Wallclock: ~30 min end-to-end (including recon retry + escape analysis)


TL;DR — fresh SFDPOT Time surface, Amendment I validates, new Amendment J born

  • Recall 8/10 blind-from-wave-onward. 25 Problems (2 critical, 13 major, 10 minor), 1 Question, 11 Improvements, 9 Praises. First pilot exercising SFDPOT Time dimension as primary focus.
  • Amendment I (empirical-probe-is-mandatory, shipped just before Pilot 7) validates weak-positive: total filed Questions dropped from 5 (Pilot 6) to 1 across 6 Testers. Drift-prone amendments (D, E-ext) fired cleanly as Problems-with-evidence, not Questions. Caveat: this plugin has modest probe-class fuel (no modals, no custom tables), so the Question-drop signal is weak-positive, not conclusive.
  • Amendments G + H correctly non-fired — Testers recorded explicit non-applicability ("no custom DB tables → G non-applicable", "no overlay UI → H non-applicable") rather than force-firing.
  • Amendment E-ext regression-free after Pilot 5 rerun tightening — fired on two separate buttons (Clear Cache + Clean Up Database) as Problems with network evidence.
  • New Amendment J born from a recon-phase blind-run scope violation: the first recon Tester read tests/plugins/magellan-speed/ISSUES.md and cited "Issue N" numbers. Recon was re-run with an explicit "DO NOT read ISSUES.md" guardrail; wave Testers all carried the guardrail. Comprehensive grep confirmed zero wave-Tester contamination. Amendment J codifies the guardrail as harness-level invariant.

Cross-pilot arc:

Pilot Shape Mode Original Rerun
1 (backups) artifact Opus 10/10
2 (contact-forms) form/email Opus 7/10 10/10
3 (members) role/restriction Opus 5/10 10/10
4 (seo-toolkit) metadata Sonnet 4/10 10/10
5 (pay) WC gateway Sonnet 8/10 9/10
6 (gallery) file/media Sonnet 8/10
7 (speed) caching / Time dim Sonnet 8/10

Reliability — PQIP totals

8/10 planted + 17 bonus findings. 25 Problems, 1 Question, 11 Improvements, 9 Praises.

Severity: 2 critical, 13 major, 10 minor, 0 trivial.

Per-charter PQIP

Charter Priority P Q I ! Duration Tool uses
page-cache critical 6 0 1 2 7m 14s 67
minify-combine critical 5 0 2 0 8m 46s 76
db-optimize high 5 0 3 2 10m 53s 89
cross-feature-lifecycle high 4 1 4 0 12m 03s 93
dashboard-admin medium 4 0 0 2 9m 38s 95
lazy-load-regex medium 1 0 1 3 5m 23s 54
Totals 25 1 11 9 53m 57s serial 474

Before/after against the 10 planted issues

# Planted issue Verdict Amendment fired
1 Dashboard chart has no axis labels missed visual-output-oracle not applied to charts
2 Clear Cache has no feedback caught-exact dashboard-admin — empirical
3 Page cache serves logged-in content caught-exact (CRITICAL) page-cache empirical two-session probe — Amendment F + I
4 Cache dir web-accessible caught-exact (MAJOR) page-cache empirical curl
5 DB cleanup wipes ALL revisions (incl. latest) missed source-pattern caught (no LIMIT) but "incl. latest" semantic probe not run
6 Cache invalidation narrow scope caught-exact page-cache + cross-feature — empirical edit + re-curl
7 Minify JS breaks URL strings caught-semantically (CRITICAL, stronger than planted) minify-combine empirical probe — browser SyntaxError captured
8 Combined asset files accumulate forever caught-exact minify-combine + cross-feature — empirical file-count
9 ksort breaks JS dep order caught-exact minify-combine — empirical probe
10 wp_localize_script data lost on combine caught-exact minify-combine — empirical probe

8 caught-exact or semantically + 2 missed = 8/10 strict.

Bonus findings beyond the answer key (17)

  • CRITICAL page-cache: admin-bar HTML leak to guests (same root cause as Issue 3 but with privacy-class severity)
  • MAJOR page-cache: settings change doesn't flush cache
  • MAJOR cross-feature: no deactivation/uninstall hook (7 options + cache dir orphan)
  • MAJOR db-optimize: transient blast wipes WP core update_plugins/update_themes
  • MAJOR db-optimize: AND-list b1-b5 + b7 all N (only b6 Y)
  • MAJOR db-optimize: DELETE-without-LIMIT on 6 tables (Amendment C3 source-pattern rule)
  • MAJOR cross-feature: 4 cross-feature seam confirmed bugs (clear × combine, activation × options, settings × cache, page × lazy)
  • MAJOR minify: @import strip breaks Google Fonts
  • MINOR page-cache: counter options write amplification
  • MINOR db-optimize: dashboard count table stale after cleanup
  • MINOR db-optimize + dashboard-admin: rapid-double-click fires two concurrent AJAX (Amendment E-ext × 2 charters)
  • MINOR dashboard-admin: wrong redirect URL on settings save
  • MINOR dashboard-admin: no beforeunload on settings form
  • MINOR lazy-load: feed content gets lazy injection (no is_feed() guard)
  • Plus 9 Praises including correctly-wired toggles, proper nonce/cap on AJAX, zero-state render cleanliness

Token consumption — aggregate

Manager (Opus 4.7) Subagents (Sonnet 4.6) Total
Agents 1 8 (2 recons + 6 Testers) 9
Messages 132 803 935
Fresh input 388 851 1,239
Output 161,400 87,735 249,135
Cache-create 5m 0 1,838,628 1,838,628
Cache-create 1h 667,226 0 667,226
Cache-read 59,632,184 71,632,613 131,264,797
Total tokens 60,461,198 74,410,827 134,872,025
Cost $40.53 $29.70 $70.23

97.3% cache-read. Manager cost is higher than prior pilots because Pilots 5, 6, 7 all ran in the same long-lived main conversation — the 1h cache amortizes but the context keeps growing. This is an accumulator effect, not a Pilot-7-specific inefficiency.

By pricing category

Category Tokens Cost %
Cache-read 131,264,797 $51.31 73.1%
Cache-create 5m 1,838,628 $6.89 9.8%
Cache-create 1h 667,226 $6.67 9.5%
Output 249,135 $5.35 7.6%
Fresh input 1,239 $0.00 0.0%

Blind-violation recovery cost

One recon cost $1.68 (blind violation, discarded) + one recon cost $1.72 (clean retry) = $3.40 spent on recon, of which $1.68 was wasted on the first attempt. This is a manageable cost for the lesson learned, but would be easily avoided with Amendment J active.


Token + duration per Pilot-7 subagent

Session Role Duration Tool uses Msgs Input Output cc5m cr Cost
recon (blind-violation) scout 4m 33s 37 53 59 6,785 198,916 2,769,599 $1.68
recon (clean retry) scout 4m 34s 40 55 59 7,813 99,374 4,087,255 $1.72
lazy-load-regex Tester 5m 23s 54 83 89 6,518 224,269 6,967,965 $3.03
page-cache Tester 7m 14s 67 100 106 10,549 290,192 9,166,228 $4.00
minify-combine Tester 8m 46s 76 117 123 13,675 250,209 9,797,644 $4.08
dashboard-admin Tester 9m 38s 95 145 151 11,488 220,411 14,616,840 $5.38
db-optimize Tester 10m 53s 89 121 127 15,969 231,317 10,481,616 $4.25
cross-feature-lifecycle Tester 12m 03s 93 129 137 14,938 323,940 13,745,466 $5.56
Totals (8) 62m 04s serial 551 803 851 87,735 1,838,628 71,632,613 $29.70

Concurrent-wave compression: 6 Testers wall-clock ~12m (bounded by cross-feature-lifecycle), sequential-equivalent ~54m. Compression ratio: ~4.5×. Recons ran serially before the wave (one failed + one retry = ~9m).

Notable:

  • cross-feature-lifecycle was the longest (12m / 93 tool uses) — 4 cross-feature seams probed empirically, each a separate flow
  • lazy-load-regex was the shortest (5m / 54 tool uses) — small surface, Tester filed mostly Praises
  • dashboard-admin highest tool-use count (95) — extensive empirical probing across 4 flows (E-ext × 2, D beforeunload, settings redirect)

Cost efficiency

Denominator Value
Total cost / planted caught (8) $8.78 per planted bug
Total cost / all Problems (25) $2.81 per Problem
Total cost / all PQIP items (46) $1.53 per PQIP item

Subtracting the Manager-conversation-growth cost ($40.53 vs Pilot 6's $28.74 for similar work), the marginal Pilot-7 cost is closer to $58, giving $7.25/planted. Still a little higher than other pilots (Pilot 5 rerun was $4.77/planted), reflecting:

  1. Long-lived Manager context accumulating across Pilots 5+6+7 in one conversation
  2. Recon retry cost ($1.68 wasted)
  3. Pilot 7's longer-per-Tester charter work on a complex plugin (cache + minify + DB together)

Amendment firing matrix (19 amendments active)

# Amendment Fired? Where Notes
1 Empty / one / many states 6/6 coverage notes clean
2 Absence-of-feature ✓✓ cross-feature (no dealloc + no uninstall), page-cache (no TTL config) clean
3 Plugin-native writes used admin UI for cache + settings; direct db for scale seed clean
4 Cross-feature MANDATORY ✓✓ cross-feature-lifecycle — 4 seam Problems, each filed with explicit literal cross-feature interaction probed: <A> × <B> → <verdict> load-bearing
5 UI-path before "missing" no over-claims filed clean
A Inline counters no counter UI fuel correct non-fire
B State variety db-optimize seeded all 6 state classes (revisions/auto-drafts/trash/spam/trash-comments/transients) clean
C Enumerate root-cause ✓✓ page-cache chained admin-bar-leak root → settings-staleness + invalidation gaps (3 bugs from 1 root) clean
D Unsaved-work protection dashboard-admin — empirical probe, filed as Problem not Question (Amendment I drift fix validated) clean
E. Admin two-tab no admin concurrent-edit bug fuel correct non-fire
E-ext. Rapid-double-submit ✓✓ dashboard-admin + db-optimize — empirical click+click on both buttons confirmed concurrent AJAX clean, regression-free after Pilot 5 tightening
F View-source HTML page-cache inspected raw cached HTML for admin bar leak clean
G. DDL column types no custom DB tables; 3 Testers explicitly recorded non-applicability correctly non-fired — generalization holds
H. Keyboard-close on overlays no overlay UI; all Testers recorded non-applicability correctly non-fired — generalization holds on first zero-fuel pilot
I. Empirical-probe-is-mandatory whole-pilot signal: Questions filed = 1 (vs 5 in Pilot 6); drift amendments D + E-ext both fired as Problems-with-evidence weak-positive — probe-fuel was low, retest on probe-rich Pilot 8
Reinf 5 empty-state MANDATORY 6/6 coverage notes clean
Reinf 8 cross-feature MANDATORY cross-feature charter literal strings clean
pqip.propagate-siblings page-cache + db-optimize each chained 3+ findings from one root cause clean
pqip.UI-path-before-claim no over-claims clean

15/19 fired actively. 3 correctly non-fired (A, E, G, H — no fuel). 1 "validated weak-positive" (I). 0 drift regressions.


The Amendment I validation signal

Amendment I shipped just before Pilot 7 to close the drift class observed in Pilots 4-6 (Testers filing source-inspected absence-of-defense as Questions rather than running empirical probes). The Pilot 7 signal:

Pilot Total Questions filed
Pilot 3 rerun 4
Pilot 4 rerun 4
Pilot 5 orig 5
Pilot 5 rerun 5
Pilot 6 (drift class identified) 5
Pilot 7 (Amendment I first pilot) 1

Amendment I is the simplest explanation for the drop. But the plugin fuel caveat applies:

  • magellan-speed has modest probe-fuel. No modal/lightbox UI → Amendment H silent. No custom tables → Amendment G silent.
  • Amendment D fired on one target (settings form beforeunload) — filed as minor Problem with empirical evidence.
  • Amendment E-ext fired on two targets (Clear Cache + Clean Up DB buttons) — filed as minor Problems with network evidence.

Both drift-prone amendments (D, E-ext) cleanly fired as Problems-with-empirical-evidence on Pilot 7. That's the Amendment I signal.

Verdict: weak-positive. Need a probe-rich Pilot 8 to confirm. Retain Amendment I rule text unchanged.


Blind-run scope violation — handled and now codified

The incident: On the first recon attempt, the Tester read tests/plugins/magellan-speed/ISSUES.md (the answer key) and cited "Issue N" numbers + a "## Known Planted Issues (from ISSUES.md — Answer Key)" section in its recon summary. This invalidated that recon pass.

Recovery:

  1. Moved the violating recon to recon-tester-output-BLIND-VIOLATION.md for forensics.
  2. Re-dispatched recon with an explicit "DO NOT read tests/plugins/.../ISSUES.md" guardrail in the prompt.
  3. The clean retry produced a recon.md citing zero answer-key references and used the Tester's own findings from source + UI probing.
  4. All 6 wave Tester invocation prompts carried the same guardrail. Comprehensive grep of all 6 session reports for ISSUES.md | answer key | issue [0-9]+ | planted returned zero hits — wave is clean.

The pilot is blind-from-wave-onward. Recon is the scout phase; it didn't produce filed findings. The wave Testers are the ones that produce the recall number, and they respected the guardrail.

Amendment J (new) — blind-run file-access guardrail — shipped via this pilot's commit. Concrete denylist + Manager-side prompt invariant + Tester-side fallback (skip + record in coverage_notes). Not a "be careful" rule — a harness-hygiene invariant. Details in the retro.


Secondary harness finding (not an amendment)

dashboard-admin Tester reported that during its session, a concurrent Tester's teardown removed the shared /tmp/magellan-plugin-src/magellan-speed symlink target (the Tester recovered after 2 turns). All 6 Testers symlink the same source dir into their isolated Studio sites; if any session's teardown runs rm -rf on the shared path, every concurrent session loses source access.

Fix: audit scripts/studio-teardown.sh — should rm the per-session symlink only (${SITE_PATH}/wp-content/plugins/<slug>), never rm -rf /tmp/magellan-plugin-src/<slug> (the shared target). Non-amendment harness follow-up.


Recommendation

  1. Ship Amendment J (blind-run file-access guardrail) — shipped in this pilot's amendment commit. Brings total to 20 amendments.
  2. Do not tighten existing rules — no regressions observed. Amendments G, H, I cleanly behaved in their first zero-fuel pilot.
  3. Audit studio-teardown.sh for shared-symlink teardown safety (non-amendment follow-up).
  4. Pilot 8 candidate: pick a probe-rich, chart-heavy plugin to:
    • Give Amendment I real probe-class fuel (many modal + submit + beforeunload surfaces)
    • Re-exercise Amendments G + H with actual fuel
    • Address the Issue-1-class miss: a plugin with rendered charts + non-textual UI lets us stress-test visual-output oracles
    • Stress-test Amendment J on a run that cares about answer-key avoidance
    • Remaining candidates from missions/: magellan-checkout-editor (second WC-ecosystem plugin — exercises woocommerce-exploration skill for the second time), or start a novel plugin outside the existing set

My pick: magellan-checkout-editor — second WC pilot validates the ecosystem skill across two plugins in the same family, and the checkout/order surface has probe-rich fuel (forms, validation, modal interactions, meta).


Cross-pilot state

Seven pilots, five reruns. Sonnet + amendments validated across five plugin shapes (members, seo-toolkit, pay, gallery, speed). No drift regressions on this pilot. Amendment J shipped — 20 amendments total, all compounding.


Artifacts

  • Final report: runs/2026-04-24T12-59-33_magellan-speed/final-report.md
  • Escape analysis: runs/2026-04-24T12-59-33_magellan-speed/escape-analysis.md
  • Token usage: runs/2026-04-24T12-59-33_magellan-speed/token-usage.json
  • Manifest: runs/2026-04-24T12-59-33_magellan-speed/manifest.json
  • 6 session reports: runs/2026-04-24T12-59-33_magellan-speed/sessions/<slug>/report.json
  • Static analysis: runs/2026-04-24T12-59-33_magellan-speed/static-analysis.md
  • Clean recon: runs/2026-04-24T12-59-33_magellan-speed/recon.md
  • Violating recon (forensics): runs/2026-04-24T12-59-33_magellan-speed/recon-tester-output-BLIND-VIOLATION.md
  • Coverage plan: runs/2026-04-24T12-59-33_magellan-speed/coverage.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment