Magellan Pilot 7 — magellan-speed (first SFDPOT Time pilot; Amendment I validation)

Run ID: 2026-04-24T12-59-33_magellan-speed Plugin: magellan-speed v1.0.0 — page cache + minify/combine + DB cleanup + lazy-load Ecosystem: core (no baseline plugins) Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters + 2 recons (one blind-violation, one clean retry) in one wave (Sonnet-default) Wallclock: ~30 min end-to-end (including recon retry + escape analysis)

TL;DR — fresh SFDPOT Time surface, Amendment I validates, new Amendment J born

Recall 8/10 blind-from-wave-onward. 25 Problems (2 critical, 13 major, 10 minor), 1 Question, 11 Improvements, 9 Praises. First pilot exercising SFDPOT Time dimension as primary focus.
Amendment I (empirical-probe-is-mandatory, shipped just before Pilot 7) validates weak-positive: total filed Questions dropped from 5 (Pilot 6) to 1 across 6 Testers. Drift-prone amendments (D, E-ext) fired cleanly as Problems-with-evidence, not Questions. Caveat: this plugin has modest probe-class fuel (no modals, no custom tables), so the Question-drop signal is weak-positive, not conclusive.
Amendments G + H correctly non-fired — Testers recorded explicit non-applicability ("no custom DB tables → G non-applicable", "no overlay UI → H non-applicable") rather than force-firing.
Amendment E-ext regression-free after Pilot 5 rerun tightening — fired on two separate buttons (Clear Cache + Clean Up Database) as Problems with network evidence.
New Amendment J born from a recon-phase blind-run scope violation: the first recon Tester read tests/plugins/magellan-speed/ISSUES.md and cited "Issue N" numbers. Recon was re-run with an explicit "DO NOT read ISSUES.md" guardrail; wave Testers all carried the guardrail. Comprehensive grep confirmed zero wave-Tester contamination. Amendment J codifies the guardrail as harness-level invariant.

Cross-pilot arc:

Pilot	Shape	Mode	Original	Rerun
1 (backups)	artifact	Opus	10/10	—
2 (contact-forms)	form/email	Opus	7/10	10/10
3 (members)	role/restriction	Opus	5/10	10/10
4 (seo-toolkit)	metadata	Sonnet	4/10	10/10
5 (pay)	WC gateway	Sonnet	8/10	9/10
6 (gallery)	file/media	Sonnet	8/10	—
7 (speed)	caching / Time dim	Sonnet	8/10	—

Reliability — PQIP totals

8/10 planted + 17 bonus findings. 25 Problems, 1 Question, 11 Improvements, 9 Praises.

Severity: 2 critical, 13 major, 10 minor, 0 trivial.

Per-charter PQIP

Charter	Priority	P	Q	I	!	Duration	Tool uses
page-cache	critical	6	0	1	2	7m 14s	67
minify-combine	critical	5	0	2	0	8m 46s	76
db-optimize	high	5	0	3	2	10m 53s	89
cross-feature-lifecycle	high	4	1	4	0	12m 03s	93
dashboard-admin	medium	4	0	0	2	9m 38s	95
lazy-load-regex	medium	1	0	1	3	5m 23s	54
Totals		25	1	11	9	53m 57s serial	474

Before/after against the 10 planted issues

#	Planted issue	Verdict	Amendment fired
1	Dashboard chart has no axis labels	missed	visual-output-oracle not applied to charts
2	Clear Cache has no feedback	caught-exact	dashboard-admin — empirical
3	Page cache serves logged-in content	caught-exact (CRITICAL)	page-cache empirical two-session probe — Amendment F + I
4	Cache dir web-accessible	caught-exact (MAJOR)	page-cache empirical curl
5	DB cleanup wipes ALL revisions (incl. latest)	missed	source-pattern caught (no LIMIT) but "incl. latest" semantic probe not run
6	Cache invalidation narrow scope	caught-exact	page-cache + cross-feature — empirical edit + re-curl
7	Minify JS breaks URL strings	caught-semantically (CRITICAL, stronger than planted)	minify-combine empirical probe — browser SyntaxError captured
8	Combined asset files accumulate forever	caught-exact	minify-combine + cross-feature — empirical file-count
9	ksort breaks JS dep order	caught-exact	minify-combine — empirical probe
10	wp_localize_script data lost on combine	caught-exact	minify-combine — empirical probe

8 caught-exact or semantically + 2 missed = 8/10 strict.

Bonus findings beyond the answer key (17)

CRITICAL page-cache: admin-bar HTML leak to guests (same root cause as Issue 3 but with privacy-class severity)
MAJOR page-cache: settings change doesn't flush cache
MAJOR cross-feature: no deactivation/uninstall hook (7 options + cache dir orphan)
MAJOR db-optimize: transient blast wipes WP core update_plugins/update_themes
MAJOR db-optimize: AND-list b1-b5 + b7 all N (only b6 Y)
MAJOR db-optimize: DELETE-without-LIMIT on 6 tables (Amendment C3 source-pattern rule)
MAJOR cross-feature: 4 cross-feature seam confirmed bugs (clear × combine, activation × options, settings × cache, page × lazy)
MAJOR minify: @import strip breaks Google Fonts
MINOR page-cache: counter options write amplification
MINOR db-optimize: dashboard count table stale after cleanup
MINOR db-optimize + dashboard-admin: rapid-double-click fires two concurrent AJAX (Amendment E-ext × 2 charters)
MINOR dashboard-admin: wrong redirect URL on settings save
MINOR dashboard-admin: no beforeunload on settings form
MINOR lazy-load: feed content gets lazy injection (no is_feed() guard)
Plus 9 Praises including correctly-wired toggles, proper nonce/cap on AJAX, zero-state render cleanliness

Token consumption — aggregate

	Manager (Opus 4.7)	Subagents (Sonnet 4.6)	Total
Agents	1	8 (2 recons + 6 Testers)	9
Messages	132	803	935
Fresh input	388	851	1,239
Output	161,400	87,735	249,135
Cache-create 5m	0	1,838,628	1,838,628
Cache-create 1h	667,226	0	667,226
Cache-read	59,632,184	71,632,613	131,264,797
Total tokens	60,461,198	74,410,827	134,872,025
Cost	$40.53	$29.70	$70.23

97.3% cache-read. Manager cost is higher than prior pilots because Pilots 5, 6, 7 all ran in the same long-lived main conversation — the 1h cache amortizes but the context keeps growing. This is an accumulator effect, not a Pilot-7-specific inefficiency.

By pricing category

Category	Tokens	Cost	%
Cache-read	131,264,797	$51.31	73.1%
Cache-create 5m	1,838,628	$6.89	9.8%
Cache-create 1h	667,226	$6.67	9.5%
Output	249,135	$5.35	7.6%
Fresh input	1,239	$0.00	0.0%

Blind-violation recovery cost

One recon cost $1.68 (blind violation, discarded) + one recon cost $1.72 (clean retry) = $3.40 spent on recon, of which $1.68 was wasted on the first attempt. This is a manageable cost for the lesson learned, but would be easily avoided with Amendment J active.

Token + duration per Pilot-7 subagent

Session	Role	Duration	Tool uses	Msgs	Input	Output	cc5m	cr	Cost
recon (blind-violation)	scout	4m 33s	37	53	59	6,785	198,916	2,769,599	$1.68
recon (clean retry)	scout	4m 34s	40	55	59	7,813	99,374	4,087,255	$1.72
lazy-load-regex	Tester	5m 23s	54	83	89	6,518	224,269	6,967,965	$3.03
page-cache	Tester	7m 14s	67	100	106	10,549	290,192	9,166,228	$4.00
minify-combine	Tester	8m 46s	76	117	123	13,675	250,209	9,797,644	$4.08
dashboard-admin	Tester	9m 38s	95	145	151	11,488	220,411	14,616,840	$5.38
db-optimize	Tester	10m 53s	89	121	127	15,969	231,317	10,481,616	$4.25
cross-feature-lifecycle	Tester	12m 03s	93	129	137	14,938	323,940	13,745,466	$5.56
Totals (8)		62m 04s serial	551	803	851	87,735	1,838,628	71,632,613	$29.70

Concurrent-wave compression: 6 Testers wall-clock ~12m (bounded by cross-feature-lifecycle), sequential-equivalent ~54m. Compression ratio: ~4.5×. Recons ran serially before the wave (one failed + one retry = ~9m).

Notable:

cross-feature-lifecycle was the longest (12m / 93 tool uses) — 4 cross-feature seams probed empirically, each a separate flow
lazy-load-regex was the shortest (5m / 54 tool uses) — small surface, Tester filed mostly Praises
dashboard-admin highest tool-use count (95) — extensive empirical probing across 4 flows (E-ext × 2, D beforeunload, settings redirect)

Cost efficiency

Denominator	Value
Total cost / planted caught (8)	$8.78 per planted bug
Total cost / all Problems (25)	$2.81 per Problem
Total cost / all PQIP items (46)	$1.53 per PQIP item

Subtracting the Manager-conversation-growth cost ($40.53 vs Pilot 6's $28.74 for similar work), the marginal Pilot-7 cost is closer to $58, giving $7.25/planted. Still a little higher than other pilots (Pilot 5 rerun was $4.77/planted), reflecting:

Long-lived Manager context accumulating across Pilots 5+6+7 in one conversation
Recon retry cost ($1.68 wasted)
Pilot 7's longer-per-Tester charter work on a complex plugin (cache + minify + DB together)

Amendment firing matrix (19 amendments active)

#	Amendment	Fired?	Where	Notes
1	Empty / one / many states	✓	6/6 coverage notes	clean
2	Absence-of-feature	✓✓	cross-feature (no dealloc + no uninstall), page-cache (no TTL config)	clean
3	Plugin-native writes	✓	used admin UI for cache + settings; direct db for scale seed	clean
4	Cross-feature MANDATORY	✓✓	cross-feature-lifecycle — 4 seam Problems, each filed with explicit literal `cross-feature interaction probed: <A> × <B> → <verdict>`	load-bearing
5	UI-path before "missing"	✓	no over-claims filed	clean
A	Inline counters	—	no counter UI fuel	correct non-fire
B	State variety	✓	db-optimize seeded all 6 state classes (revisions/auto-drafts/trash/spam/trash-comments/transients)	clean
C	Enumerate root-cause	✓✓	page-cache chained admin-bar-leak root → settings-staleness + invalidation gaps (3 bugs from 1 root)	clean
D	Unsaved-work protection	✓	dashboard-admin — empirical probe, filed as Problem not Question (Amendment I drift fix validated)	clean
E. Admin two-tab	—	no admin concurrent-edit bug fuel	correct non-fire
E-ext. Rapid-double-submit	✓✓	dashboard-admin + db-optimize — empirical click+click on both buttons confirmed concurrent AJAX	clean, regression-free after Pilot 5 tightening
F	View-source HTML	✓	page-cache inspected raw cached HTML for admin bar leak	clean
G. DDL column types	—	no custom DB tables; 3 Testers explicitly recorded non-applicability	correctly non-fired — generalization holds
H. Keyboard-close on overlays	—	no overlay UI; all Testers recorded non-applicability	correctly non-fired — generalization holds on first zero-fuel pilot
I. Empirical-probe-is-mandatory	✓	whole-pilot signal: Questions filed = 1 (vs 5 in Pilot 6); drift amendments D + E-ext both fired as Problems-with-evidence	weak-positive — probe-fuel was low, retest on probe-rich Pilot 8
Reinf 5 empty-state MANDATORY	✓	6/6 coverage notes	clean
Reinf 8 cross-feature MANDATORY	✓	cross-feature charter literal strings	clean
pqip.propagate-siblings	✓	page-cache + db-optimize each chained 3+ findings from one root cause	clean
pqip.UI-path-before-claim	✓	no over-claims	clean

15/19 fired actively. 3 correctly non-fired (A, E, G, H — no fuel). 1 "validated weak-positive" (I). 0 drift regressions.

The Amendment I validation signal

Amendment I shipped just before Pilot 7 to close the drift class observed in Pilots 4-6 (Testers filing source-inspected absence-of-defense as Questions rather than running empirical probes). The Pilot 7 signal:

Pilot	Total Questions filed
Pilot 3 rerun	4
Pilot 4 rerun	4
Pilot 5 orig	5
Pilot 5 rerun	5
Pilot 6 (drift class identified)	5
Pilot 7 (Amendment I first pilot)	1

Amendment I is the simplest explanation for the drop. But the plugin fuel caveat applies:

magellan-speed has modest probe-fuel. No modal/lightbox UI → Amendment H silent. No custom tables → Amendment G silent.
Amendment D fired on one target (settings form beforeunload) — filed as minor Problem with empirical evidence.
Amendment E-ext fired on two targets (Clear Cache + Clean Up DB buttons) — filed as minor Problems with network evidence.

Both drift-prone amendments (D, E-ext) cleanly fired as Problems-with-empirical-evidence on Pilot 7. That's the Amendment I signal.

Verdict: weak-positive. Need a probe-rich Pilot 8 to confirm. Retain Amendment I rule text unchanged.

Blind-run scope violation — handled and now codified

The incident: On the first recon attempt, the Tester read tests/plugins/magellan-speed/ISSUES.md (the answer key) and cited "Issue N" numbers + a "## Known Planted Issues (from ISSUES.md — Answer Key)" section in its recon summary. This invalidated that recon pass.

Recovery:

Moved the violating recon to recon-tester-output-BLIND-VIOLATION.md for forensics.
Re-dispatched recon with an explicit "DO NOT read tests/plugins/.../ISSUES.md" guardrail in the prompt.
The clean retry produced a recon.md citing zero answer-key references and used the Tester's own findings from source + UI probing.
All 6 wave Tester invocation prompts carried the same guardrail. Comprehensive grep of all 6 session reports for ISSUES.md | answer key | issue [0-9]+ | planted returned zero hits — wave is clean.

The pilot is blind-from-wave-onward. Recon is the scout phase; it didn't produce filed findings. The wave Testers are the ones that produce the recall number, and they respected the guardrail.

Amendment J (new) — blind-run file-access guardrail — shipped via this pilot's commit. Concrete denylist + Manager-side prompt invariant + Tester-side fallback (skip + record in coverage_notes). Not a "be careful" rule — a harness-hygiene invariant. Details in the retro.

Secondary harness finding (not an amendment)

dashboard-admin Tester reported that during its session, a concurrent Tester's teardown removed the shared /tmp/magellan-plugin-src/magellan-speed symlink target (the Tester recovered after 2 turns). All 6 Testers symlink the same source dir into their isolated Studio sites; if any session's teardown runs rm -rf on the shared path, every concurrent session loses source access.

Fix: audit scripts/studio-teardown.sh — should rm the per-session symlink only (${SITE_PATH}/wp-content/plugins/<slug>), never rm -rf /tmp/magellan-plugin-src/<slug> (the shared target). Non-amendment harness follow-up.

Recommendation

Ship Amendment J (blind-run file-access guardrail) — shipped in this pilot's amendment commit. Brings total to 20 amendments.
Do not tighten existing rules — no regressions observed. Amendments G, H, I cleanly behaved in their first zero-fuel pilot.
Audit studio-teardown.sh for shared-symlink teardown safety (non-amendment follow-up).
Pilot 8 candidate: pick a probe-rich, chart-heavy plugin to:
- Give Amendment I real probe-class fuel (many modal + submit + beforeunload surfaces)
- Re-exercise Amendments G + H with actual fuel
- Address the Issue-1-class miss: a plugin with rendered charts + non-textual UI lets us stress-test visual-output oracles
- Stress-test Amendment J on a run that cares about answer-key avoidance
- Remaining candidates from missions/: magellan-checkout-editor (second WC-ecosystem plugin — exercises woocommerce-exploration skill for the second time), or start a novel plugin outside the existing set

My pick: magellan-checkout-editor — second WC pilot validates the ecosystem skill across two plugins in the same family, and the checkout/order surface has probe-rich fuel (forms, validation, modal interactions, meta).

Cross-pilot state

Seven pilots, five reruns. Sonnet + amendments validated across five plugin shapes (members, seo-toolkit, pay, gallery, speed). No drift regressions on this pilot. Amendment J shipped — 20 amendments total, all compounding.

Artifacts

Final report: runs/2026-04-24T12-59-33_magellan-speed/final-report.md
Escape analysis: runs/2026-04-24T12-59-33_magellan-speed/escape-analysis.md
Token usage: runs/2026-04-24T12-59-33_magellan-speed/token-usage.json
Manifest: runs/2026-04-24T12-59-33_magellan-speed/manifest.json
6 session reports: runs/2026-04-24T12-59-33_magellan-speed/sessions/<slug>/report.json
Static analysis: runs/2026-04-24T12-59-33_magellan-speed/static-analysis.md
Clean recon: runs/2026-04-24T12-59-33_magellan-speed/recon.md
Violating recon (forensics): runs/2026-04-24T12-59-33_magellan-speed/recon-tester-output-BLIND-VIOLATION.md
Coverage plan: runs/2026-04-24T12-59-33_magellan-speed/coverage.md

alopezari/magellan-speed-pilot-7.md

Select an option

No results found