Magellan Pilot 8 — magellan-checkout-editor (2nd WooCommerce pilot; Amendment K first real test)

Run ID: 2026-04-24T14-16-15_magellan-checkout-editor Plugin: magellan-checkout-editor v1.0.0 — WooCommerce extension for custom checkout fields (drag+drop, 7 field types, conditional logic, validation, order-meta, email injection, JSON import/export) Ecosystem: woocommerce (2nd pilot exercising skills/woocommerce-exploration/SKILL.md) Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters + 1 recon in one concurrent wave (Sonnet-default, blind greybox) Wallclock: ~26 min end-to-end

TL;DR — Amendment K cleanly validated; I conclusive test mixed; big bonus surface

Recall: 6/10 strict (5 caught-exact + 1 caught-bundled, 4 missed). Lower than Pilot 5's 8/10 on the prior WC plugin, but with 13 off-answer-key bonus Problems including a CRITICAL block-checkout incompatibility (same class-of-bug Pilot 5 caught on magellan-pay).
Amendment K (default blast radius) first real test → CLEAN FIRE. import-export Tester wrote the exact literal coverage-note "default blast radius probed: Import spares existing config? → N" and filed as major with the 2→4→6 empirical import-twice probe. Spec-for-bug.
Amendment I (empirical-probe-is-mandatory) conclusive test → mixed but holding. 4 Questions filed (up from 1 in Pilot 7). Analysis: 3 are legitimate intent-questions paired with empirical Problems; 1 is a drift regression (Tester marked b6 cap-check Y on an aggregate when one handler has it and another doesn't).
Amendment J (blind-run file-access guardrail) → CLEAN. Zero answer-key contamination across 6 Testers + recon. Two Testers explicitly recorded Amendment J: skipped <path> — answer-key denylist compliance notes.
Amendments G + H correctly non-fired — Testers recorded explicit non-applicability.
WooCommerce ecosystem skill 2nd validation: load-bearing. Drove HPOS-compat probes, email-class-coverage probes, and block-vs-classic-checkout probes that have no analog in non-WC pilots.

Cross-pilot arc:

Pilot	Shape	Mode	Original	Rerun
1 (backups)	artifact	Opus	10/10	—
2 (contact-forms)	form/email	Opus	7/10	10/10
3 (members)	role/restriction	Opus	5/10	10/10
4 (seo-toolkit)	metadata	Sonnet	4/10	10/10
5 (pay, WC #1)	WC gateway	Sonnet	8/10	9/10
6 (gallery)	file/media	Sonnet	8/10	—
7 (speed)	caching / Time	Sonnet	8/10	—
8 (checkout-editor, WC #2)	WC field editor	Sonnet	6/10	—

Reliability — PQIP totals

6/10 planted + 13 bonus findings. 23 Problems, 4 Questions, 11 Improvements, 12 Praises.

Severity: 1 critical, 9 major, 12 minor, 1 trivial.

Per-charter PQIP

Charter	Priority	P	Q	I	!	Duration	Tool uses
fields-admin-crud	critical	3	0	1	2	9m 04s	78
frontend-render-and-validation	critical	5	0	2	2	10m 44s	82
order-meta-and-email	high	3	1	2	2	11m 00s	66
import-export	high	4	1	2	2	10m 44s	84
cross-feature-lifecycle	high	3	1	2	3	10m 46s	74
field-variants	medium	5	1	2	1	9m 21s	79
Totals		23	4	11	12	61m 38s serial	463

Before/after against the 10 planted issues

#	Planted issue	Verdict	Amendment fired
1	Date-picker class mismatch (`.mce-datepicker` vs `.mce-date-picker`)	caught-exact (×2)	field-variants + frontend-render, empirical
2	Position badges don't update after drag	missed	drag flow budgeted away by `fields-admin-crud`
3	Conditional-logic change-event binding missing	missed	`frontend-render-and-validation` probed the validation side but not the JS show/hide event binding
4	Wrong regex error message (always "is required")	caught-exact (×2)	field-variants + frontend-render — exact dead-`$msg` variable identified
5	Import `array_merge` doubles on re-import	caught-exact (×2)	Amendment K clean fire — `import-export` + `cross-feature-lifecycle`
6	Orphaned `_mce_*` postmeta when field removed	caught-bundled	`cross-feature-lifecycle` filed broader lifecycle-cleanup Problem; same root-cause class
7	Import AJAX handler lacks `current_user_can()`	missed (with anti-catch)	Amendment I drift regression — Tester marked b6=Y on aggregate; admin-post has cap but AJAX doesn't
8	Custom fields missing from Customer Completed Order email	caught-exact	`order-meta-and-email` — enumerated whitelist + empirical reproducer
9	Custom-select fields not keyboard-accessible	missed	Amendment H classification gap — coverage.md said "no overlay UI" but custom-select IS a dropdown overlay
10	HTML entity round-trip in JSON export	missed	Artifact AND-list a1-a6 covers contents/inclusion but not round-trip identity

5 caught-exact + 1 caught-bundled + 4 missed = 6/10 strict.

Bonus findings beyond the answer key (13)

CRITICAL: Block checkout invisibility — WC 10.7 default /checkout/ block has customFieldCount=0; plugin hooks only woocommerce_checkout_fields (classic API). Same class-of-bug pattern as Pilot 5 magellan-pay. Caught independently by 2 Testers.
MAJOR: Invalid admin-supplied regex causes PHP Warning + HTTP 500 on order review AJAX
MAJOR: Conditional-hidden-required fields still server-validate (checkout blocks with "is required" even though field is hidden)
MAJOR: HPOS shadow-post reliance — update_post_meta($order_id, ...) writes to wp_postmeta shadow rows; $order->get_meta('_mce_phone') returns empty
MAJOR: Cross-section field-key collision — two fields with key=phone in billing + shipping both write to _mce_phone, last-write-wins (data loss)
MAJOR: Save Fields button no disable-on-click (empirical rapid-double-submit fired 5 concurrent)
MINOR: Silent save (no success notice despite ?saved=1 in redirect)
MINOR: Duplicate field-key within same section saves silently (no uniqueness validation)
MINOR: Import stores raw HTML/script tags unsanitized (admin display escapes, but data-at-rest is tainted)
MINOR: No lifecycle hooks → option + postmeta persist after uninstall
MINOR: No beforeunload on settings form (Amendment D empirical)
MINOR: Leading comma in options CSV → blank first select option overrides placeholder
MINOR: Customer Completed Order email blocked: admin sees "Sorry, you are not allowed" when WC deactivated while plugin active

Token consumption — aggregate

	Manager (Opus 4.7)	Subagents (Sonnet 4.6)	Total
Agents	1	7 (1 recon + 6 Testers)	8
Messages	77	744	821
Fresh input	192	786	978
Output	111,222	64,357	175,579
Cache-create 5m	0	1,520,558	1,520,558
Cache-create 1h	167,062	0	167,062
Cache-read	48,610,363	70,092,652	118,703,015
Total tokens	48,888,839	72,464,353	121,353,192
Cost	$28.76	$27.70	$56.45

97.8% cache-read. Manager cost holds similar to prior pilots; cache-create 1h (Manager-level) is shrinking as the context stabilizes.

By pricing category

Category	Tokens	Cost	%
Cache-read	118,703,015	$45.33	80.3%
Cache-create 5m	1,520,558	$5.70	10.1%
Output	175,579	$3.75	6.6%
Cache-create 1h	167,062	$1.67	3.0%
Fresh input	978	$0.00	0.0%

Token + duration per subagent (7 sessions)

Session	Role	Duration	Tool uses	Msgs	Input	Output	cc5m	cr	Cost
recon	scout	5m 26s	57	75	81	8,237	283,575	6,373,306	$3.10
fields-admin-crud	Tester	9m 04s	78	112	118	10,946	299,085	11,135,772	$4.63
field-variants	Tester	9m 21s	79	114	120	13,103	173,542	9,942,513	$3.83
frontend-render-and-validation	Tester	10m 44s	82	120	126	7,366	203,246	11,872,187	$4.43
import-export	Tester	10m 44s	84	126	132	6,826	207,306	12,731,470	$4.70
cross-feature-lifecycle	Tester	10m 46s	74	100	106	9,436	177,950	9,222,920	$3.58
order-meta-and-email	Tester	11m 00s	66	97	103	8,443	175,854	8,814,484	$3.43
Totals (7)		67m 05s serial	520	744	786	64,357	1,520,558	70,092,652	$27.70

Concurrent-wave compression: 6-Tester wave wallclock ~11m (bounded by longest Tester), sequential-equivalent 61m 38s. Compression ratio: 5.6× — highest of any pilot so far. Reasons: charter surfaces were well-scoped, recon was serial but fast (5m 26s), and the WooCommerce provisioning penalty amortized across 6 parallel sites.

Notable per-Tester observations:

Cost discipline is visibly stabilizing — 6 Testers spent $4.63 / $3.83 / $4.43 / $4.70 / $3.58 / $3.43 — a much tighter band than prior pilots where individual Tester costs ranged 2-3× within a wave.
fields-admin-crud was slowest per-token — 78 tool calls in 9m 04s; dominated by role rotation (admin → shop_manager → editor) + three-state form probes.
order-meta-and-email was the cheapest at $3.43 and fewest tool calls (66) — the Tester was efficient with the email-class enumeration probe.

Cost efficiency

Denominator	Value
Total cost / planted caught (6)	$9.41 per planted bug
Total cost / all Problems (23)	$2.45 per Problem
Total cost / all PQIP items (50)	$1.13 per PQIP item

Per-planted is up a hair vs Pilot 7 ($8.78) because recall dropped 8→6 (more misses, same cost envelope). Per-Problem is LOWEST of any pilot, reflecting the high bonus-finding yield — the harness is producing strong insight even when answer-key recall slips.

Amendment firing matrix (21 current amendments)

#	Amendment	Fired?	Where
1 empty-state	✓	6/6 coverage notes	clean
2 absence-of-feature	✓✓	cross-feature-lifecycle (no hooks), fields-admin-crud (silent save, duplicate keys)	clean
3 plugin-native writes	✓	Testers used admin UI + UI checkout submit, not only DB seeds	clean
4 cross-feature MANDATORY	✓✓	cross-feature-lifecycle filed 4 seams with literal Reinforcement-8 strings	load-bearing
5 UI-path-before-claim	✓	no over-claims	clean
A inline counters	—	no fuel	correct non-fire
B state variety	✓	field-variants exercised all 7 types + import-export sanitize-bypass probe	clean
C enumerate root cause	✓	frontend-render-and-validation chained 3 validation bugs from one module	clean
D unsaved-work	✓	fields-admin-crud: `window.onbeforeunload` null, filed as Problem with empirical evidence	drift-free (Amendment I holds)
E admin two-tab	—	no admin-form concurrent-edit bug	correct non-fire
E-ext rapid-double-submit	✓	fields-admin-crud (5 concurrent Save) + import-export (5 concurrent Import) empirically	clean
F view-source HTML	✓	frontend-render-and-validation view-sourced condition-attrs + checkout HTML	clean
G DDL column types	—	no custom DB tables; recorded non-applicability	correctly non-fired — 3rd generalization
H keyboard-close overlay	— (miss)	no overlay UI in coverage.md; custom-select widget SHOULD have fired	classification gap, see Miss 3
I empirical-probe-is-mandatory	✓ mostly	Qs filed = 4 (up from 1 Pilot 7). Analysis: 3 legitimate, 1 drift regression (see Miss 2)	weak-positive, 1 regression — tightening proposed
J blind-run file-access guardrail	✓✓	zero contamination across 6 sessions + recon; 2 explicit `Amendment J: skipped...` notes	clean first-in-wild test
K default blast radius	✓✓✓	import-export: exact literal coverage-note `"default blast radius probed: Import spares existing config? → N"`	SPEC-FOR-BUG CLEAN FIRE, first real test
Reinf 5 empty-state MANDATORY	✓	6/6 coverage notes	clean
Reinf 8 cross-feature MANDATORY	✓	cross-feature-lifecycle with 4 literal seam strings	clean
pqip.propagate-siblings	✓	date-picker + conditional-logic caught by 2 charters each	clean
pqip.UI-path-before-claim	✓	no over-claims filed	clean

19/21 fired actively or correctly non-fired. 1 drift regression (I → Miss 2). 1 classification gap (H → Miss 3, rule is correctly specced but didn't fire because recon didn't anchor the widget shape).

Key validations

Amendment K (default blast radius) — first real test → CLEAN FIRE

import-export Tester, after enabling Amendment K in the charter, ran an empirical 2-import probe: started with 2 fields, imported a 2-field JSON, got 4 fields, imported again, got 6. Filed as major Problem with the exact literal coverage-note "default blast radius probed: Import spares existing config? → N".

This is the first pilot with real probe-fuel for Amendment K. The rule text was specific enough that the Tester converged on the exact probe ritual + exact coverage-note format the spec prescribed. No rule-text refinement needed.

Amendment I (empirical-probe-is-mandatory) — conclusive test → mixed, 1 drift regression

Pilot	Total Questions filed	Drift-adjacent Q?
Pilot 6	5	4 drift-class
Pilot 7	1	0 drift (low probe-fuel caveat)
Pilot 8	4	1 drift (Miss 2 b6 aggregate)

3 of 4 Questions are legitimate (intent questions paired with empirical Problems, or architecturally blocked). 1 is a regression where a Tester marked b6=Y on an aggregate AND-list without probing that the cap check is present on EACH handler.

Proposed tightening on the destructive-op AND-list (1-2 sentence addition to existing section, not a new amendment):

b6 is per-handler, never aggregate. When a destructive operation is reachable via multiple entry points (admin-post + wp_ajax_* + REST + front-controller), enumerate every entry point and verify current_user_can() on each. A plugin commonly has capability coverage on one path and a gap on an adjacent path — never score b6=Y based on one handler when another handler on the same feature is unprotected. Write as b6 per-handler: admin-post=Y, ajax=N, rest=N/A.

Amendment J (blind-run guardrail) — first in-wild test → CLEAN

Zero answer-key contamination:

Recon respected the guardrail
6 wave Testers respected the guardrail
2 Testers explicitly recorded Amendment J: skipped <path> — answer-key denylist compliance notes (the harness-level "I saw it and chose not to read it" acknowledgment)
Comprehensive grep for ISSUES.md | answer key | Issue [0-9]+ | planted returned zero wave contamination

The Amendment J denylist approach works. The cost it prevents: $1.68 of wasted recon + re-dispatch overhead (as happened in Pilot 7).

Amendments G + H — generalization holds (3rd zero-fuel pilot for G, 2nd for H)

Both correctly non-fired with explicit non-applicability notes. G on Pilot 6 + 7 + 8; H on Pilot 7 + 8 (one classification gap on this pilot, see Miss 3 — but the rule itself is correctly specced).

WooCommerce skill 2nd validation — load-bearing

The skills/woocommerce-exploration/SKILL.md fired probes that have no analog in non-WC pilots:

HPOS compatibility: order-meta-and-email probed both wp_postmeta and wc_orders_meta tables after an order submission — a probe class specific to modern WooCommerce
Email class coverage: enumerated the 7 WC_Email_* classes and identified that Customer_Completed_Order is NOT in the plugin's whitelist
Block-vs-classic checkout: every frontend-render Tester checked both the block /checkout/ and a classic shortcode /classic-checkout/ (same class of bug that Pilot 5 caught on magellan-pay, independently rediscovered here by 2 Testers)

Skill is doing its job across two very different WC plugin shapes. Validated across 2 WC plugins.

Cross-plugin pattern — block-checkout incompatibility

Observed on:

Pilot 5 magellan-pay Issue 1 (CRITICAL, caught)
Pilot 8 magellan-checkout-editor — CRITICAL bonus finding (caught independently by 2 Testers)

Two plugins, same class: hooks only woocommerce_checkout_fields (classic API), invisible on WC 10+ default block checkout. Recommendation: add a one-paragraph note in skills/woocommerce-exploration/SKILL.md codifying the probe — "for any WC plugin that hooks woocommerce_checkout_fields or legacy classic-checkout hooks, probe the block checkout visibility as a mandatory cross-version compat check." Not a full amendment, just a skill-file clarification.

4 proposed changes (review & ship as you prefer)

1. Amendment L (new) — Interactive widget event-binding probe

Targets Issues 2 (drag-badge) + 3 (conditional change-event). For every plugin-declared client-side interactive widget (drag-reorder, conditional show/hide, live validation, tab-switch, collapse/expand, inline edit, auto-save), execute the documented interaction and empirically verify the widget re-computes or re-renders. Coverage-note format: interactive widget probed: <widget> reacts to <trigger>? → <Y/N>.

Narrow but generalizable — fits as subsection under "Probe what the feature produces, not just what it does".

2. b6-per-handler tightening to destructive-op AND-list

Targets Miss 2 (b6 aggregate drift). 1-2 sentence addition to the destructive-op AND-list section. See text above.

3. Amendment H classification hint on recon/static-analysis

Targets Miss 3 (custom-select widget not recognized as overlay). Add one sentence to static-analysis and recon anchor lists: grep for data-*-custom-*, replaceWith, .hide(); $.append patterns that replace native form controls — each is a candidate Amendment H target.

4. Artifact AND-list a7 — round-trip identity

Targets Miss 4 (HTML entity export encoding). Add a7 to the existing artifact AND-list: for features that both produce and consume the artifact (export × import), probe round-trip identity with semantically-interesting values ("Tom & Jerry", "<script>", multi-byte). Mismatches indicate encoding boundaries.

5. WooCommerce skill — block-vs-classic checkout paragraph

Targets the cross-plugin pattern observed on Pilot 5 + Pilot 8. Not a full amendment, just a skill-file clarification in skills/woocommerce-exploration/SKILL.md.

Remaining plugin set

Only magellan-theme left from the test-plugin mission set — a block theme. First pilot exercising skills/block-theme-exploration/SKILL.md. Would test:

Whether existing amendments hold on theme-vs-plugin kind
Whether the theme-specific skill produces probes analogous to ecosystem-exploration skills
Whether kind: theme branching in the harness flows cleanly through Phase 1-5

Cross-pilot state

8 pilots, 5 reruns, 21 amendments, 0 regressions in amendment rule text. Three plugins in the WooCommerce ecosystem validated (pay, checkout-editor — 2 blind + Pilot 5 rerun). Sonnet + amendments continues to hold. Amendment K has clean fuel validation. Amendment I has weak-positive conclusive evidence with one drift regression that the proposed b6-per-handler tightening resolves.

Artifacts

Final report: runs/2026-04-24T14-16-15_magellan-checkout-editor/final-report.md
Escape analysis: runs/2026-04-24T14-16-15_magellan-checkout-editor/escape-analysis.md
Token usage: runs/2026-04-24T14-16-15_magellan-checkout-editor/token-usage.json
Manifest: runs/2026-04-24T14-16-15_magellan-checkout-editor/manifest.json
6 session reports: runs/2026-04-24T14-16-15_magellan-checkout-editor/sessions/<slug>/report.json
Static analysis: runs/2026-04-24T14-16-15_magellan-checkout-editor/static-analysis.md
Recon: runs/2026-04-24T14-16-15_magellan-checkout-editor/recon.md
Coverage plan: runs/2026-04-24T14-16-15_magellan-checkout-editor/coverage.md
Pilot 5 (WC #1) comparison: https://gist.github.com/alopezari/7dd744c19c0ad21b2de8c630513967f6

alopezari/magellan-checkout-editor-pilot-8.md

Select an option

No results found