Run ID: 2026-04-24T14-16-15_magellan-checkout-editor
Plugin: magellan-checkout-editor v1.0.0 — WooCommerce extension for custom checkout fields (drag+drop, 7 field types, conditional logic, validation, order-meta, email injection, JSON import/export)
Ecosystem: woocommerce (2nd pilot exercising skills/woocommerce-exploration/SKILL.md)
Driver: Chrome DevTools MCP with --experimental-page-id-routing
Dispatch: 6 charters + 1 recon in one concurrent wave (Sonnet-default, blind greybox)
Wallclock: ~26 min end-to-end
- Recall: 6/10 strict (5 caught-exact + 1 caught-bundled, 4 missed). Lower than Pilot 5's 8/10 on the prior WC plugin, but with 13 off-answer-key bonus Problems including a CRITICAL block-checkout incompatibility (same class-of-bug Pilot 5 caught on magellan-pay).
- Amendment K (default blast radius) first real test → CLEAN FIRE.
import-exportTester wrote the exact literal coverage-note"default blast radius probed: Import spares existing config? → N"and filed as major with the 2→4→6 empirical import-twice probe. Spec-for-bug. - Amendment I (empirical-probe-is-mandatory) conclusive test → mixed but holding. 4 Questions filed (up from 1 in Pilot 7). Analysis: 3 are legitimate intent-questions paired with empirical Problems; 1 is a drift regression (Tester marked b6 cap-check Y on an aggregate when one handler has it and another doesn't).
- Amendment J (blind-run file-access guardrail) → CLEAN. Zero answer-key contamination across 6 Testers + recon. Two Testers explicitly recorded
Amendment J: skipped <path> — answer-key denylistcompliance notes. - Amendments G + H correctly non-fired — Testers recorded explicit non-applicability.
- WooCommerce ecosystem skill 2nd validation: load-bearing. Drove HPOS-compat probes, email-class-coverage probes, and block-vs-classic-checkout probes that have no analog in non-WC pilots.
Cross-pilot arc:
| Pilot | Shape | Mode | Original | Rerun |
|---|---|---|---|---|
| 1 (backups) | artifact | Opus | 10/10 | — |
| 2 (contact-forms) | form/email | Opus | 7/10 | 10/10 |
| 3 (members) | role/restriction | Opus | 5/10 | 10/10 |
| 4 (seo-toolkit) | metadata | Sonnet | 4/10 | 10/10 |
| 5 (pay, WC #1) | WC gateway | Sonnet | 8/10 | 9/10 |
| 6 (gallery) | file/media | Sonnet | 8/10 | — |
| 7 (speed) | caching / Time | Sonnet | 8/10 | — |
| 8 (checkout-editor, WC #2) | WC field editor | Sonnet | 6/10 | — |
6/10 planted + 13 bonus findings. 23 Problems, 4 Questions, 11 Improvements, 12 Praises.
Severity: 1 critical, 9 major, 12 minor, 1 trivial.
| Charter | Priority | P | Q | I | ! | Duration | Tool uses |
|---|---|---|---|---|---|---|---|
| fields-admin-crud | critical | 3 | 0 | 1 | 2 | 9m 04s | 78 |
| frontend-render-and-validation | critical | 5 | 0 | 2 | 2 | 10m 44s | 82 |
| order-meta-and-email | high | 3 | 1 | 2 | 2 | 11m 00s | 66 |
| import-export | high | 4 | 1 | 2 | 2 | 10m 44s | 84 |
| cross-feature-lifecycle | high | 3 | 1 | 2 | 3 | 10m 46s | 74 |
| field-variants | medium | 5 | 1 | 2 | 1 | 9m 21s | 79 |
| Totals | 23 | 4 | 11 | 12 | 61m 38s serial | 463 |
| # | Planted issue | Verdict | Amendment fired |
|---|---|---|---|
| 1 | Date-picker class mismatch (.mce-datepicker vs .mce-date-picker) |
caught-exact (×2) | field-variants + frontend-render, empirical |
| 2 | Position badges don't update after drag | missed | drag flow budgeted away by fields-admin-crud |
| 3 | Conditional-logic change-event binding missing | missed | frontend-render-and-validation probed the validation side but not the JS show/hide event binding |
| 4 | Wrong regex error message (always "is required") | caught-exact (×2) | field-variants + frontend-render — exact dead-$msg variable identified |
| 5 | Import array_merge doubles on re-import |
caught-exact (×2) | Amendment K clean fire — import-export + cross-feature-lifecycle |
| 6 | Orphaned _mce_* postmeta when field removed |
caught-bundled | cross-feature-lifecycle filed broader lifecycle-cleanup Problem; same root-cause class |
| 7 | Import AJAX handler lacks current_user_can() |
missed (with anti-catch) | Amendment I drift regression — Tester marked b6=Y on aggregate; admin-post has cap but AJAX doesn't |
| 8 | Custom fields missing from Customer Completed Order email | caught-exact | order-meta-and-email — enumerated whitelist + empirical reproducer |
| 9 | Custom-select fields not keyboard-accessible | missed | Amendment H classification gap — coverage.md said "no overlay UI" but custom-select IS a dropdown overlay |
| 10 | HTML entity round-trip in JSON export | missed | Artifact AND-list a1-a6 covers contents/inclusion but not round-trip identity |
5 caught-exact + 1 caught-bundled + 4 missed = 6/10 strict.
- CRITICAL: Block checkout invisibility — WC 10.7 default
/checkout/block hascustomFieldCount=0; plugin hooks onlywoocommerce_checkout_fields(classic API). Same class-of-bug pattern as Pilot 5 magellan-pay. Caught independently by 2 Testers. - MAJOR: Invalid admin-supplied regex causes PHP Warning + HTTP 500 on order review AJAX
- MAJOR: Conditional-hidden-required fields still server-validate (checkout blocks with "is required" even though field is hidden)
- MAJOR: HPOS shadow-post reliance —
update_post_meta($order_id, ...)writes towp_postmetashadow rows;$order->get_meta('_mce_phone')returns empty - MAJOR: Cross-section field-key collision — two fields with
key=phonein billing + shipping both write to_mce_phone, last-write-wins (data loss) - MAJOR: Save Fields button no disable-on-click (empirical rapid-double-submit fired 5 concurrent)
- MINOR: Silent save (no success notice despite
?saved=1in redirect) - MINOR: Duplicate field-key within same section saves silently (no uniqueness validation)
- MINOR: Import stores raw HTML/script tags unsanitized (admin display escapes, but data-at-rest is tainted)
- MINOR: No lifecycle hooks → option + postmeta persist after uninstall
- MINOR: No beforeunload on settings form (Amendment D empirical)
- MINOR: Leading comma in options CSV → blank first select option overrides placeholder
- MINOR: Customer Completed Order email blocked: admin sees "Sorry, you are not allowed" when WC deactivated while plugin active
| Manager (Opus 4.7) | Subagents (Sonnet 4.6) | Total | |
|---|---|---|---|
| Agents | 1 | 7 (1 recon + 6 Testers) | 8 |
| Messages | 77 | 744 | 821 |
| Fresh input | 192 | 786 | 978 |
| Output | 111,222 | 64,357 | 175,579 |
| Cache-create 5m | 0 | 1,520,558 | 1,520,558 |
| Cache-create 1h | 167,062 | 0 | 167,062 |
| Cache-read | 48,610,363 | 70,092,652 | 118,703,015 |
| Total tokens | 48,888,839 | 72,464,353 | 121,353,192 |
| Cost | $28.76 | $27.70 | $56.45 |
97.8% cache-read. Manager cost holds similar to prior pilots; cache-create 1h (Manager-level) is shrinking as the context stabilizes.
| Category | Tokens | Cost | % |
|---|---|---|---|
| Cache-read | 118,703,015 | $45.33 | 80.3% |
| Cache-create 5m | 1,520,558 | $5.70 | 10.1% |
| Output | 175,579 | $3.75 | 6.6% |
| Cache-create 1h | 167,062 | $1.67 | 3.0% |
| Fresh input | 978 | $0.00 | 0.0% |
| Session | Role | Duration | Tool uses | Msgs | Input | Output | cc5m | cr | Cost |
|---|---|---|---|---|---|---|---|---|---|
| recon | scout | 5m 26s | 57 | 75 | 81 | 8,237 | 283,575 | 6,373,306 | $3.10 |
| fields-admin-crud | Tester | 9m 04s | 78 | 112 | 118 | 10,946 | 299,085 | 11,135,772 | $4.63 |
| field-variants | Tester | 9m 21s | 79 | 114 | 120 | 13,103 | 173,542 | 9,942,513 | $3.83 |
| frontend-render-and-validation | Tester | 10m 44s | 82 | 120 | 126 | 7,366 | 203,246 | 11,872,187 | $4.43 |
| import-export | Tester | 10m 44s | 84 | 126 | 132 | 6,826 | 207,306 | 12,731,470 | $4.70 |
| cross-feature-lifecycle | Tester | 10m 46s | 74 | 100 | 106 | 9,436 | 177,950 | 9,222,920 | $3.58 |
| order-meta-and-email | Tester | 11m 00s | 66 | 97 | 103 | 8,443 | 175,854 | 8,814,484 | $3.43 |
| Totals (7) | 67m 05s serial | 520 | 744 | 786 | 64,357 | 1,520,558 | 70,092,652 | $27.70 |
Concurrent-wave compression: 6-Tester wave wallclock ~11m (bounded by longest Tester), sequential-equivalent 61m 38s. Compression ratio: 5.6× — highest of any pilot so far. Reasons: charter surfaces were well-scoped, recon was serial but fast (5m 26s), and the WooCommerce provisioning penalty amortized across 6 parallel sites.
Notable per-Tester observations:
- Cost discipline is visibly stabilizing — 6 Testers spent $4.63 / $3.83 / $4.43 / $4.70 / $3.58 / $3.43 — a much tighter band than prior pilots where individual Tester costs ranged 2-3× within a wave.
fields-admin-crudwas slowest per-token — 78 tool calls in 9m 04s; dominated by role rotation (admin → shop_manager → editor) + three-state form probes.order-meta-and-emailwas the cheapest at $3.43 and fewest tool calls (66) — the Tester was efficient with the email-class enumeration probe.
| Denominator | Value |
|---|---|
| Total cost / planted caught (6) | $9.41 per planted bug |
| Total cost / all Problems (23) | $2.45 per Problem |
| Total cost / all PQIP items (50) | $1.13 per PQIP item |
Per-planted is up a hair vs Pilot 7 ($8.78) because recall dropped 8→6 (more misses, same cost envelope). Per-Problem is LOWEST of any pilot, reflecting the high bonus-finding yield — the harness is producing strong insight even when answer-key recall slips.
| # | Amendment | Fired? | Where | Notes |
|---|---|---|---|---|
| 1 empty-state | ✓ | 6/6 coverage notes | clean | |
| 2 absence-of-feature | ✓✓ | cross-feature-lifecycle (no hooks), fields-admin-crud (silent save, duplicate keys) | clean | |
| 3 plugin-native writes | ✓ | Testers used admin UI + UI checkout submit, not only DB seeds | clean | |
| 4 cross-feature MANDATORY | ✓✓ | cross-feature-lifecycle filed 4 seams with literal Reinforcement-8 strings | load-bearing | |
| 5 UI-path-before-claim | ✓ | no over-claims | clean | |
| A inline counters | — | no fuel | correct non-fire | |
| B state variety | ✓ | field-variants exercised all 7 types + import-export sanitize-bypass probe | clean | |
| C enumerate root cause | ✓ | frontend-render-and-validation chained 3 validation bugs from one module | clean | |
| D unsaved-work | ✓ | fields-admin-crud: window.onbeforeunload null, filed as Problem with empirical evidence |
drift-free (Amendment I holds) | |
| E admin two-tab | — | no admin-form concurrent-edit bug | correct non-fire | |
| E-ext rapid-double-submit | ✓ | fields-admin-crud (5 concurrent Save) + import-export (5 concurrent Import) empirically | clean | |
| F view-source HTML | ✓ | frontend-render-and-validation view-sourced condition-attrs + checkout HTML | clean | |
| G DDL column types | — | no custom DB tables; recorded non-applicability | correctly non-fired — 3rd generalization | |
| H keyboard-close overlay | — (miss) | no overlay UI in coverage.md; custom-select widget SHOULD have fired | classification gap, see Miss 3 | |
| I empirical-probe-is-mandatory | ✓ mostly | Qs filed = 4 (up from 1 Pilot 7). Analysis: 3 legitimate, 1 drift regression (see Miss 2) | weak-positive, 1 regression — tightening proposed | |
| J blind-run file-access guardrail | ✓✓ | zero contamination across 6 sessions + recon; 2 explicit Amendment J: skipped... notes |
clean first-in-wild test | |
| K default blast radius | ✓✓✓ | import-export: exact literal coverage-note "default blast radius probed: Import spares existing config? → N" |
SPEC-FOR-BUG CLEAN FIRE, first real test | |
| Reinf 5 empty-state MANDATORY | ✓ | 6/6 coverage notes | clean | |
| Reinf 8 cross-feature MANDATORY | ✓ | cross-feature-lifecycle with 4 literal seam strings | clean | |
| pqip.propagate-siblings | ✓ | date-picker + conditional-logic caught by 2 charters each | clean | |
| pqip.UI-path-before-claim | ✓ | no over-claims filed | clean |
19/21 fired actively or correctly non-fired. 1 drift regression (I → Miss 2). 1 classification gap (H → Miss 3, rule is correctly specced but didn't fire because recon didn't anchor the widget shape).
import-export Tester, after enabling Amendment K in the charter, ran an empirical 2-import probe: started with 2 fields, imported a 2-field JSON, got 4 fields, imported again, got 6. Filed as major Problem with the exact literal coverage-note "default blast radius probed: Import spares existing config? → N".
This is the first pilot with real probe-fuel for Amendment K. The rule text was specific enough that the Tester converged on the exact probe ritual + exact coverage-note format the spec prescribed. No rule-text refinement needed.
| Pilot | Total Questions filed | Drift-adjacent Q? |
|---|---|---|
| Pilot 6 | 5 | 4 drift-class |
| Pilot 7 | 1 | 0 drift (low probe-fuel caveat) |
| Pilot 8 | 4 | 1 drift (Miss 2 b6 aggregate) |
3 of 4 Questions are legitimate (intent questions paired with empirical Problems, or architecturally blocked). 1 is a regression where a Tester marked b6=Y on an aggregate AND-list without probing that the cap check is present on EACH handler.
Proposed tightening on the destructive-op AND-list (1-2 sentence addition to existing section, not a new amendment):
b6 is per-handler, never aggregate. When a destructive operation is reachable via multiple entry points (admin-post +
wp_ajax_*+ REST + front-controller), enumerate every entry point and verifycurrent_user_can()on each. A plugin commonly has capability coverage on one path and a gap on an adjacent path — never score b6=Y based on one handler when another handler on the same feature is unprotected. Write asb6 per-handler: admin-post=Y, ajax=N, rest=N/A.
Zero answer-key contamination:
- Recon respected the guardrail
- 6 wave Testers respected the guardrail
- 2 Testers explicitly recorded
Amendment J: skipped <path> — answer-key denylistcompliance notes (the harness-level "I saw it and chose not to read it" acknowledgment) - Comprehensive grep for
ISSUES.md | answer key | Issue [0-9]+ | plantedreturned zero wave contamination
The Amendment J denylist approach works. The cost it prevents: $1.68 of wasted recon + re-dispatch overhead (as happened in Pilot 7).
Both correctly non-fired with explicit non-applicability notes. G on Pilot 6 + 7 + 8; H on Pilot 7 + 8 (one classification gap on this pilot, see Miss 3 — but the rule itself is correctly specced).
The skills/woocommerce-exploration/SKILL.md fired probes that have no analog in non-WC pilots:
- HPOS compatibility:
order-meta-and-emailprobed bothwp_postmetaandwc_orders_metatables after an order submission — a probe class specific to modern WooCommerce - Email class coverage: enumerated the 7 WC_Email_* classes and identified that Customer_Completed_Order is NOT in the plugin's whitelist
- Block-vs-classic checkout: every frontend-render Tester checked both the block
/checkout/and a classic shortcode/classic-checkout/(same class of bug that Pilot 5 caught on magellan-pay, independently rediscovered here by 2 Testers)
Skill is doing its job across two very different WC plugin shapes. Validated across 2 WC plugins.
Observed on:
- Pilot 5 magellan-pay Issue 1 (CRITICAL, caught)
- Pilot 8 magellan-checkout-editor — CRITICAL bonus finding (caught independently by 2 Testers)
Two plugins, same class: hooks only woocommerce_checkout_fields (classic API), invisible on WC 10+ default block checkout. Recommendation: add a one-paragraph note in skills/woocommerce-exploration/SKILL.md codifying the probe — "for any WC plugin that hooks woocommerce_checkout_fields or legacy classic-checkout hooks, probe the block checkout visibility as a mandatory cross-version compat check." Not a full amendment, just a skill-file clarification.
Targets Issues 2 (drag-badge) + 3 (conditional change-event). For every plugin-declared client-side interactive widget (drag-reorder, conditional show/hide, live validation, tab-switch, collapse/expand, inline edit, auto-save), execute the documented interaction and empirically verify the widget re-computes or re-renders. Coverage-note format: interactive widget probed: <widget> reacts to <trigger>? → <Y/N>.
Narrow but generalizable — fits as subsection under "Probe what the feature produces, not just what it does".
Targets Miss 2 (b6 aggregate drift). 1-2 sentence addition to the destructive-op AND-list section. See text above.
Targets Miss 3 (custom-select widget not recognized as overlay). Add one sentence to static-analysis and recon anchor lists: grep for data-*-custom-*, replaceWith, .hide(); $.append patterns that replace native form controls — each is a candidate Amendment H target.
Targets Miss 4 (HTML entity export encoding). Add a7 to the existing artifact AND-list: for features that both produce and consume the artifact (export × import), probe round-trip identity with semantically-interesting values ("Tom & Jerry", "<script>", multi-byte). Mismatches indicate encoding boundaries.
Targets the cross-plugin pattern observed on Pilot 5 + Pilot 8. Not a full amendment, just a skill-file clarification in skills/woocommerce-exploration/SKILL.md.
Only magellan-theme left from the test-plugin mission set — a block theme. First pilot exercising skills/block-theme-exploration/SKILL.md. Would test:
- Whether existing amendments hold on theme-vs-plugin kind
- Whether the theme-specific skill produces probes analogous to ecosystem-exploration skills
- Whether
kind: themebranching in the harness flows cleanly through Phase 1-5
8 pilots, 5 reruns, 21 amendments, 0 regressions in amendment rule text. Three plugins in the WooCommerce ecosystem validated (pay, checkout-editor — 2 blind + Pilot 5 rerun). Sonnet + amendments continues to hold. Amendment K has clean fuel validation. Amendment I has weak-positive conclusive evidence with one drift regression that the proposed b6-per-handler tightening resolves.
- Final report:
runs/2026-04-24T14-16-15_magellan-checkout-editor/final-report.md - Escape analysis:
runs/2026-04-24T14-16-15_magellan-checkout-editor/escape-analysis.md - Token usage:
runs/2026-04-24T14-16-15_magellan-checkout-editor/token-usage.json - Manifest:
runs/2026-04-24T14-16-15_magellan-checkout-editor/manifest.json - 6 session reports:
runs/2026-04-24T14-16-15_magellan-checkout-editor/sessions/<slug>/report.json - Static analysis:
runs/2026-04-24T14-16-15_magellan-checkout-editor/static-analysis.md - Recon:
runs/2026-04-24T14-16-15_magellan-checkout-editor/recon.md - Coverage plan:
runs/2026-04-24T14-16-15_magellan-checkout-editor/coverage.md - Pilot 5 (WC #1) comparison: https://gist.github.com/alopezari/7dd744c19c0ad21b2de8c630513967f6