Magellan Pilot 5 rerun — magellan-pay (amendments G + E-ext validation)

Run ID: 2026-04-24T11-40-34_magellan-pay-rerun Plugin: magellan-pay v1.0.0 (same plugin as Pilot 5 orig) Purpose: Validate that the 2 amendments shipped in commit 195db81 (Amendment G = DDL column types for value semantics; Amendment E extension = rapid-double-submit on customer-facing writes) close the 2 misses observed in original Pilot 5. Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters in one concurrent wave (Sonnet-default Testers) Wallclock: ~28 min (cleaner than Pilot 5 orig which ran ~60 min with recon + retries)

TL;DR

Recall: 9/10 strict, 10/10 lenient (up from 8/10 in Pilot 5 orig).

Amendment G fired cleanly on 4 of 6 Testers — strongest single-amendment convergence in project history. Four independent Testers all caught Issue 5 (FLOAT for money) by reading the DDL + running the reproducer (0.1 + 0.1 + 0.1 → 0.299999...).
Amendment E extension fired partially — one Tester raised the double-submit hypothesis and source-inspected for all four defenses (no JS file, no disable-on-click, no nonce idempotency, no server-side dedup). But it filed as a Question rather than a Problem because the Tester didn't execute the empirical click(); click() reproducer. Classification drift — same pattern as Amendment C's Question-not-Problem fire on Pilot 4 rerun's Issue 4.

The loop is self-improving: the Pilot 5 rerun exposes a new rule-text weakness (Amendment E-ext is too soft — doesn't force the empirical probe). Proposed tightening below.

Reliability — PQIP totals

9/10 planted caught-exact/bundled + 14 bonus findings. 23 Problems, 5 Questions, 7 Improvements, 10 Praises.

Severity: 8 critical, 11 major, 4 minor, 0 trivial.

Per-charter PQIP (rerun vs orig)

Charter	Rerun P/Q/I/!	Orig P/Q/I/!	Duration
gateway-settings	6 / 1 / 1 / 1	4 / 1 / 1 / 2	19m 00s
checkout-payment-flow	5 / 2 / 2 / 2	6 / 1 / 3 / 3	26m 32s
refund-flow	6 / 1 / 1 / 1	2 / 1 / 2 / 2	22m 26s
transaction-log-scale	3 / 0 / 1 / 2	2 / 0 / 2 / 3	18m 25s
cross-feature-stock-refund	2 / 1 / 2 / 1	3 / 1 / 3 / 1	24m 53s
transaction-log-admin	1 / 0 / 0 / 3	1 / 1 / 1 / 3	16m 01s
Totals	23 / 5 / 7 / 10	18 / 5 / 12 / 14	127m sequential

Rerun's Problem count went up (23 vs 18) because (a) Amendment G found Issue 5 four times over, (b) refund-flow went deeper (6 vs 2 Problems, with three now filed as critical). The Improvements went down because many observations that would have been Improvements in orig were escalated to Problems under the tighter Amendment G rule.

Before/after on the 2 previously-missed issues

Issue	Pilot 5 orig	Pilot 5 rerun	Amendment fired
5 — FLOAT for money	✗ MISSED	✓ caught-exact (×4 Testers)	Amendment G — spec-for-bug, strongest amendment convergence ever
10 — double-click duplicate orders	✗ MISSED	~ caught-bundled as Question	Amendment E extension — fired but filed as Question (classification drift)

All 8 previously-caught issues stayed caught, most escalated in severity.

Token consumption — aggregate

	Manager (Opus 4.7)	Subagents (Sonnet 4.6)	Total
Agents	1	6 Testers (no recon — reused site knowledge)	7
Messages	68	693	761
Fresh input	218	729	947
Output	119,586	100,864	220,450
Cache-create 5m	0	1,387,940	1,387,940
Cache-create 1h	138,540	0	138,540
Cache-read	22,878,329	67,910,953	90,789,282
Total tokens	23,136,673	70,130,486	93,267,159
Cost	$15.82	$27.09	$42.91

97.3% cache-read. Cheaper than Pilot 5 orig ($42.91 vs $53.93 — 20% savings) because the rerun skipped recon (reused Pilot 5's recon.md) and had no retry waste.

By pricing category

Category	Tokens	Cost	%
Cache-read	90,789,282	$31.81	74.1%
Cache-create 5m	1,387,940	$5.20	12.1%
Output	220,450	$4.50	10.5%
Cache-create 1h	138,540	$1.39	3.2%
Fresh input	947	$0.00	0.0%

Token + duration per Tester (6 sessions)

Charter	Duration	Tool uses	Msgs	Input	Output	cc5m	cr	Cost
transaction-log-admin	16m 01s	65	93	99	9,768	264,906	8,705,487	$3.75
transaction-log-scale	18m 25s	64	82	88	17,644	158,196	6,833,971	$2.91
gateway-settings	19m 00s	82	117	123	17,399	280,860	10,711,092	$4.53
refund-flow	22m 26s	77	117	123	20,005	217,936	12,157,016	$4.76
cross-feature-stock-refund	24m 53s	95	135	141	17,793	239,143	13,824,515	$5.31
checkout-payment-flow	26m 32s	104	149	155	18,255	226,899	15,678,872	$5.83
Totals (6)	127m 17s serial	487	693	729	100,864	1,387,940	67,910,953	$27.09

Concurrent-wave compression: 6 Testers dispatched at the same time; total sequential-equivalent = 127m 17s, wallclock of the wave ≈ 26m 32s (bounded by the longest Tester). Compression ratio: 4.8× — notably better than Pilot 5 orig's 3.4× and Pilot 4 rerun's 3.3×. Why? Less charter-setup churn this time (all Testers had already-seen charters) and no recon Tester needed.

Notable per-Tester observations

checkout-payment-flow went longest (26m / 104 tool uses) — this was the Tester that also did the source-inspection for double-submit defenses. It filed as a Question rather than risk-budget the click(); click() empirical probe.
refund-flow escalated dramatically from orig (2 → 6 Problems, mostly critical) — more thorough probe of direct-call injection paths.
transaction-log-admin was fastest (16m / 65 tool uses) — the charter surface is small; same XSS critical as orig.

Cost efficiency

Denominator	Rerun	Orig
Cost / planted caught	$4.77 (9 caught)	$6.74
Cost / Problem filed	$1.87 (23 filed)	$3.00
Total cost	$42.91	$53.93

Clean single-wave execution, no retry waste. ~20% cheaper than orig for ~28% more Problems.

Amendment firing matrix on the 17 current rules

(15 prior + Amendment G + Amendment E extension)

#	Amendment	Fired?	Where	Effectiveness
1	Empty / one / many states	✓	6/6 coverage notes	clean
2	Absence-of-feature	✓	gateway-settings — uninstall hook + dead testmode + dead api_key	clean
3	Plugin-native writes	✓	plugin-native checkout, refund — direct $wpdb only for seed	clean
4	Cross-feature MANDATORY	✓✓	`cross-feature-stock-refund` — 2 criticals	load-bearing
5	UI-path before "missing"	✓	no empirically-wrong claims	clean
A	Inline counters	—	no fuel (no live-counter UI)	correct non-fire
B	State variety	✓	transaction-log-admin seeded success + failed + refunded + XSS probe	clean
C	Enumerate root-cause surface	✓	cross-feature: wc_reduce_stock_levels root → 3 findings chained	clean
D	Unsaved-work protection	✓ (minor)	gateway-settings flagged missing `beforeunload` (minor); orig had said WC core provides it — divergence	ambiguous
E — admin two-tab	n/a	—	no admin-form concurrency bug on this plugin
E-ext — customer rapid-submit	~ partial	checkout-payment-flow filed as Question, not Problem (source-inspected but didn't click twice)	drift — needs tightening
F	View-source HTML	✓✓	gateway-settings — API Secret value leaked in raw HTML	load-bearing
G — DDL column types	✓✓✓	4 of 6 Testers: gateway-settings, transaction-log-scale, refund-flow, checkout-payment-flow	strongest convergence in project history
Reinf 5 (empty-state MANDATORY)	✓	6/6 coverage notes with the literal string	clean
Reinf 8 (cross-feature MANDATORY)	✓	5/6 coverage notes with the literal string	clean
pqip.propagate-sibling-features	✓	checkout filed Luhn + expiry + CVC as one "absent card validation" cluster	clean
pqip.UI-path-before-claim	✓	no over-claims	clean

15 of 17 amendments fired actively; 1 correctly didn't fire (A); 1 partial drift (E-ext).

Operational learning — rule-text tightening loop

The loop pattern: pilot → escape-analysis → ship rule → re-run → observe whether rule fires cleanly or drifts → tighten rule. Amendment E-ext now enters the tightening phase, same family as Amendment C's Question-drift on Pilot 4 rerun.

The current rule text says "must be probed" and lists reproducer options but doesn't explicitly close the "source-inspection is sufficient" loophole. The Tester did exactly what a loophole-reading reader would: looked for the defenses in source, found them all absent, and filed the hypothesis as a Question because it hadn't executed the empirical probe.

Proposed rule-text tightening for Amendment E-ext:

Source-inspection alone is NOT sufficient to discharge this probe. If the source-inspection says "no JS disable-on-click file found, no idempotency key in the form HTML, no server-side dedup in the handler" — that is evidence supporting the hypothesis but not proof of the bug. You MUST execute the empirical reproducer (two click verbs back-to-back on the submit selector, or evaluate_script firing two requestSubmit() calls) and count the resulting side-effect (rows in the plugin's table, external API calls, order creations). A Question is appropriate ONLY when the empirical probe is architecturally blocked (e.g., the site is unreachable from the Tester's environment). Otherwise, file as a Problem with the empirical count as evidence.

This rule-text extension would close the drift. Same pattern should be swept across all "probe" amendments — any rule that says "probe X" should explicitly say "execute X, not just infer X from source".

Recommendation

Ship the Amendment E-ext rule-text tightening (~1 paragraph).
Optionally sweep other "probe" amendments for the same loophole — though most existing rules already name concrete probe actions, not just "inspect".
Declare Pilot 5 closed — 9/10 strict with the only remaining miss being the classification-drift soft-fire, not a coverage gap.
Proceed to Pilot 6 — magellan-gallery — file-handling (MIME, path traversal, size variants) is the next uncovered shape. Amendment G generalizes-test: does the DDL column-type rule fire on file-metadata storage columns?

Cross-pilot summary

Five pilots now, four reruns, all converged to ≥ 9/10 on amended harness:

Pilot	Shape	Model	Original	Rerun
1 (backups)	artifact-producer	Opus	10/10	—
2 (contact-forms)	form + email	Opus	7/10	10/10
3 (members)	role/restriction	Opus	5/10	10/10
4 (seo-toolkit)	metadata/rendering	Sonnet	4/10	10/10
5 (pay)	WC payment gateway	Sonnet	8/10	9/10 (10/10 lenient)

Sonnet + amended harness is validated across three plugin shapes. The loop continues to compound: each pilot's amendments stay and fire on later pilots.

Artifacts

Final report: runs/2026-04-24T11-40-34_magellan-pay-rerun/final-report.md
Escape analysis: runs/2026-04-24T11-40-34_magellan-pay-rerun/escape-analysis.md
Token usage (full detail): runs/2026-04-24T11-40-34_magellan-pay-rerun/token-usage.json
Manifest: runs/2026-04-24T11-40-34_magellan-pay-rerun/manifest.json
6 session reports: runs/2026-04-24T11-40-34_magellan-pay-rerun/sessions/<slug>/report.json
Pilot 5 orig gist (comparison): https://gist.github.com/alopezari/7dd744c19c0ad21b2de8c630513967f6

alopezari/magellan-pay-pilot-5-rerun-validation.md

Select an option

No results found