Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 24, 2026 14:53
Show Gist options
  • Select an option

  • Save alopezari/6ef1e379ed0f69cfe04274fc9d92e37a to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/6ef1e379ed0f69cfe04274fc9d92e37a to your computer and use it in GitHub Desktop.
Magellan Pilot 8 — magellan-checkout-editor (2nd WC pilot, 6/10 blind; Amendment K clean fire + Amendment J discipline validated + one Amendment I drift regression)

Magellan Pilot 8 — magellan-checkout-editor (2nd WooCommerce pilot; Amendment K first real test)

Run ID: 2026-04-24T14-16-15_magellan-checkout-editor Plugin: magellan-checkout-editor v1.0.0 — WooCommerce extension for custom checkout fields (drag+drop, 7 field types, conditional logic, validation, order-meta, email injection, JSON import/export) Ecosystem: woocommerce (2nd pilot exercising skills/woocommerce-exploration/SKILL.md) Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters + 1 recon in one concurrent wave (Sonnet-default, blind greybox) Wallclock: ~26 min end-to-end


TL;DR — Amendment K cleanly validated; I conclusive test mixed; big bonus surface

  • Recall: 6/10 strict (5 caught-exact + 1 caught-bundled, 4 missed). Lower than Pilot 5's 8/10 on the prior WC plugin, but with 13 off-answer-key bonus Problems including a CRITICAL block-checkout incompatibility (same class-of-bug Pilot 5 caught on magellan-pay).
  • Amendment K (default blast radius) first real test → CLEAN FIRE. import-export Tester wrote the exact literal coverage-note "default blast radius probed: Import spares existing config? → N" and filed as major with the 2→4→6 empirical import-twice probe. Spec-for-bug.
  • Amendment I (empirical-probe-is-mandatory) conclusive test → mixed but holding. 4 Questions filed (up from 1 in Pilot 7). Analysis: 3 are legitimate intent-questions paired with empirical Problems; 1 is a drift regression (Tester marked b6 cap-check Y on an aggregate when one handler has it and another doesn't).
  • Amendment J (blind-run file-access guardrail) → CLEAN. Zero answer-key contamination across 6 Testers + recon. Two Testers explicitly recorded Amendment J: skipped <path> — answer-key denylist compliance notes.
  • Amendments G + H correctly non-fired — Testers recorded explicit non-applicability.
  • WooCommerce ecosystem skill 2nd validation: load-bearing. Drove HPOS-compat probes, email-class-coverage probes, and block-vs-classic-checkout probes that have no analog in non-WC pilots.

Cross-pilot arc:

Pilot Shape Mode Original Rerun
1 (backups) artifact Opus 10/10
2 (contact-forms) form/email Opus 7/10 10/10
3 (members) role/restriction Opus 5/10 10/10
4 (seo-toolkit) metadata Sonnet 4/10 10/10
5 (pay, WC #1) WC gateway Sonnet 8/10 9/10
6 (gallery) file/media Sonnet 8/10
7 (speed) caching / Time Sonnet 8/10
8 (checkout-editor, WC #2) WC field editor Sonnet 6/10

Reliability — PQIP totals

6/10 planted + 13 bonus findings. 23 Problems, 4 Questions, 11 Improvements, 12 Praises.

Severity: 1 critical, 9 major, 12 minor, 1 trivial.

Per-charter PQIP

Charter Priority P Q I ! Duration Tool uses
fields-admin-crud critical 3 0 1 2 9m 04s 78
frontend-render-and-validation critical 5 0 2 2 10m 44s 82
order-meta-and-email high 3 1 2 2 11m 00s 66
import-export high 4 1 2 2 10m 44s 84
cross-feature-lifecycle high 3 1 2 3 10m 46s 74
field-variants medium 5 1 2 1 9m 21s 79
Totals 23 4 11 12 61m 38s serial 463

Before/after against the 10 planted issues

# Planted issue Verdict Amendment fired
1 Date-picker class mismatch (.mce-datepicker vs .mce-date-picker) caught-exact (×2) field-variants + frontend-render, empirical
2 Position badges don't update after drag missed drag flow budgeted away by fields-admin-crud
3 Conditional-logic change-event binding missing missed frontend-render-and-validation probed the validation side but not the JS show/hide event binding
4 Wrong regex error message (always "is required") caught-exact (×2) field-variants + frontend-render — exact dead-$msg variable identified
5 Import array_merge doubles on re-import caught-exact (×2) Amendment K clean fireimport-export + cross-feature-lifecycle
6 Orphaned _mce_* postmeta when field removed caught-bundled cross-feature-lifecycle filed broader lifecycle-cleanup Problem; same root-cause class
7 Import AJAX handler lacks current_user_can() missed (with anti-catch) Amendment I drift regression — Tester marked b6=Y on aggregate; admin-post has cap but AJAX doesn't
8 Custom fields missing from Customer Completed Order email caught-exact order-meta-and-email — enumerated whitelist + empirical reproducer
9 Custom-select fields not keyboard-accessible missed Amendment H classification gap — coverage.md said "no overlay UI" but custom-select IS a dropdown overlay
10 HTML entity round-trip in JSON export missed Artifact AND-list a1-a6 covers contents/inclusion but not round-trip identity

5 caught-exact + 1 caught-bundled + 4 missed = 6/10 strict.

Bonus findings beyond the answer key (13)

  • CRITICAL: Block checkout invisibility — WC 10.7 default /checkout/ block has customFieldCount=0; plugin hooks only woocommerce_checkout_fields (classic API). Same class-of-bug pattern as Pilot 5 magellan-pay. Caught independently by 2 Testers.
  • MAJOR: Invalid admin-supplied regex causes PHP Warning + HTTP 500 on order review AJAX
  • MAJOR: Conditional-hidden-required fields still server-validate (checkout blocks with "is required" even though field is hidden)
  • MAJOR: HPOS shadow-post reliance — update_post_meta($order_id, ...) writes to wp_postmeta shadow rows; $order->get_meta('_mce_phone') returns empty
  • MAJOR: Cross-section field-key collision — two fields with key=phone in billing + shipping both write to _mce_phone, last-write-wins (data loss)
  • MAJOR: Save Fields button no disable-on-click (empirical rapid-double-submit fired 5 concurrent)
  • MINOR: Silent save (no success notice despite ?saved=1 in redirect)
  • MINOR: Duplicate field-key within same section saves silently (no uniqueness validation)
  • MINOR: Import stores raw HTML/script tags unsanitized (admin display escapes, but data-at-rest is tainted)
  • MINOR: No lifecycle hooks → option + postmeta persist after uninstall
  • MINOR: No beforeunload on settings form (Amendment D empirical)
  • MINOR: Leading comma in options CSV → blank first select option overrides placeholder
  • MINOR: Customer Completed Order email blocked: admin sees "Sorry, you are not allowed" when WC deactivated while plugin active

Token consumption — aggregate

Manager (Opus 4.7) Subagents (Sonnet 4.6) Total
Agents 1 7 (1 recon + 6 Testers) 8
Messages 77 744 821
Fresh input 192 786 978
Output 111,222 64,357 175,579
Cache-create 5m 0 1,520,558 1,520,558
Cache-create 1h 167,062 0 167,062
Cache-read 48,610,363 70,092,652 118,703,015
Total tokens 48,888,839 72,464,353 121,353,192
Cost $28.76 $27.70 $56.45

97.8% cache-read. Manager cost holds similar to prior pilots; cache-create 1h (Manager-level) is shrinking as the context stabilizes.

By pricing category

Category Tokens Cost %
Cache-read 118,703,015 $45.33 80.3%
Cache-create 5m 1,520,558 $5.70 10.1%
Output 175,579 $3.75 6.6%
Cache-create 1h 167,062 $1.67 3.0%
Fresh input 978 $0.00 0.0%

Token + duration per subagent (7 sessions)

Session Role Duration Tool uses Msgs Input Output cc5m cr Cost
recon scout 5m 26s 57 75 81 8,237 283,575 6,373,306 $3.10
fields-admin-crud Tester 9m 04s 78 112 118 10,946 299,085 11,135,772 $4.63
field-variants Tester 9m 21s 79 114 120 13,103 173,542 9,942,513 $3.83
frontend-render-and-validation Tester 10m 44s 82 120 126 7,366 203,246 11,872,187 $4.43
import-export Tester 10m 44s 84 126 132 6,826 207,306 12,731,470 $4.70
cross-feature-lifecycle Tester 10m 46s 74 100 106 9,436 177,950 9,222,920 $3.58
order-meta-and-email Tester 11m 00s 66 97 103 8,443 175,854 8,814,484 $3.43
Totals (7) 67m 05s serial 520 744 786 64,357 1,520,558 70,092,652 $27.70

Concurrent-wave compression: 6-Tester wave wallclock ~11m (bounded by longest Tester), sequential-equivalent 61m 38s. Compression ratio: 5.6× — highest of any pilot so far. Reasons: charter surfaces were well-scoped, recon was serial but fast (5m 26s), and the WooCommerce provisioning penalty amortized across 6 parallel sites.

Notable per-Tester observations:

  • Cost discipline is visibly stabilizing — 6 Testers spent $4.63 / $3.83 / $4.43 / $4.70 / $3.58 / $3.43 — a much tighter band than prior pilots where individual Tester costs ranged 2-3× within a wave.
  • fields-admin-crud was slowest per-token — 78 tool calls in 9m 04s; dominated by role rotation (admin → shop_manager → editor) + three-state form probes.
  • order-meta-and-email was the cheapest at $3.43 and fewest tool calls (66) — the Tester was efficient with the email-class enumeration probe.

Cost efficiency

Denominator Value
Total cost / planted caught (6) $9.41 per planted bug
Total cost / all Problems (23) $2.45 per Problem
Total cost / all PQIP items (50) $1.13 per PQIP item

Per-planted is up a hair vs Pilot 7 ($8.78) because recall dropped 8→6 (more misses, same cost envelope). Per-Problem is LOWEST of any pilot, reflecting the high bonus-finding yield — the harness is producing strong insight even when answer-key recall slips.


Amendment firing matrix (21 current amendments)

# Amendment Fired? Where Notes
1 empty-state 6/6 coverage notes clean
2 absence-of-feature ✓✓ cross-feature-lifecycle (no hooks), fields-admin-crud (silent save, duplicate keys) clean
3 plugin-native writes Testers used admin UI + UI checkout submit, not only DB seeds clean
4 cross-feature MANDATORY ✓✓ cross-feature-lifecycle filed 4 seams with literal Reinforcement-8 strings load-bearing
5 UI-path-before-claim no over-claims clean
A inline counters no fuel correct non-fire
B state variety field-variants exercised all 7 types + import-export sanitize-bypass probe clean
C enumerate root cause frontend-render-and-validation chained 3 validation bugs from one module clean
D unsaved-work fields-admin-crud: window.onbeforeunload null, filed as Problem with empirical evidence drift-free (Amendment I holds)
E admin two-tab no admin-form concurrent-edit bug correct non-fire
E-ext rapid-double-submit fields-admin-crud (5 concurrent Save) + import-export (5 concurrent Import) empirically clean
F view-source HTML frontend-render-and-validation view-sourced condition-attrs + checkout HTML clean
G DDL column types no custom DB tables; recorded non-applicability correctly non-fired — 3rd generalization
H keyboard-close overlay — (miss) no overlay UI in coverage.md; custom-select widget SHOULD have fired classification gap, see Miss 3
I empirical-probe-is-mandatory ✓ mostly Qs filed = 4 (up from 1 Pilot 7). Analysis: 3 legitimate, 1 drift regression (see Miss 2) weak-positive, 1 regression — tightening proposed
J blind-run file-access guardrail ✓✓ zero contamination across 6 sessions + recon; 2 explicit Amendment J: skipped... notes clean first-in-wild test
K default blast radius ✓✓✓ import-export: exact literal coverage-note "default blast radius probed: Import spares existing config? → N" SPEC-FOR-BUG CLEAN FIRE, first real test
Reinf 5 empty-state MANDATORY 6/6 coverage notes clean
Reinf 8 cross-feature MANDATORY cross-feature-lifecycle with 4 literal seam strings clean
pqip.propagate-siblings date-picker + conditional-logic caught by 2 charters each clean
pqip.UI-path-before-claim no over-claims filed clean

19/21 fired actively or correctly non-fired. 1 drift regression (I → Miss 2). 1 classification gap (H → Miss 3, rule is correctly specced but didn't fire because recon didn't anchor the widget shape).


Key validations

Amendment K (default blast radius) — first real test → CLEAN FIRE

import-export Tester, after enabling Amendment K in the charter, ran an empirical 2-import probe: started with 2 fields, imported a 2-field JSON, got 4 fields, imported again, got 6. Filed as major Problem with the exact literal coverage-note "default blast radius probed: Import spares existing config? → N".

This is the first pilot with real probe-fuel for Amendment K. The rule text was specific enough that the Tester converged on the exact probe ritual + exact coverage-note format the spec prescribed. No rule-text refinement needed.

Amendment I (empirical-probe-is-mandatory) — conclusive test → mixed, 1 drift regression

Pilot Total Questions filed Drift-adjacent Q?
Pilot 6 5 4 drift-class
Pilot 7 1 0 drift (low probe-fuel caveat)
Pilot 8 4 1 drift (Miss 2 b6 aggregate)

3 of 4 Questions are legitimate (intent questions paired with empirical Problems, or architecturally blocked). 1 is a regression where a Tester marked b6=Y on an aggregate AND-list without probing that the cap check is present on EACH handler.

Proposed tightening on the destructive-op AND-list (1-2 sentence addition to existing section, not a new amendment):

b6 is per-handler, never aggregate. When a destructive operation is reachable via multiple entry points (admin-post + wp_ajax_* + REST + front-controller), enumerate every entry point and verify current_user_can() on each. A plugin commonly has capability coverage on one path and a gap on an adjacent path — never score b6=Y based on one handler when another handler on the same feature is unprotected. Write as b6 per-handler: admin-post=Y, ajax=N, rest=N/A.

Amendment J (blind-run guardrail) — first in-wild test → CLEAN

Zero answer-key contamination:

  • Recon respected the guardrail
  • 6 wave Testers respected the guardrail
  • 2 Testers explicitly recorded Amendment J: skipped <path> — answer-key denylist compliance notes (the harness-level "I saw it and chose not to read it" acknowledgment)
  • Comprehensive grep for ISSUES.md | answer key | Issue [0-9]+ | planted returned zero wave contamination

The Amendment J denylist approach works. The cost it prevents: $1.68 of wasted recon + re-dispatch overhead (as happened in Pilot 7).

Amendments G + H — generalization holds (3rd zero-fuel pilot for G, 2nd for H)

Both correctly non-fired with explicit non-applicability notes. G on Pilot 6 + 7 + 8; H on Pilot 7 + 8 (one classification gap on this pilot, see Miss 3 — but the rule itself is correctly specced).

WooCommerce skill 2nd validation — load-bearing

The skills/woocommerce-exploration/SKILL.md fired probes that have no analog in non-WC pilots:

  • HPOS compatibility: order-meta-and-email probed both wp_postmeta and wc_orders_meta tables after an order submission — a probe class specific to modern WooCommerce
  • Email class coverage: enumerated the 7 WC_Email_* classes and identified that Customer_Completed_Order is NOT in the plugin's whitelist
  • Block-vs-classic checkout: every frontend-render Tester checked both the block /checkout/ and a classic shortcode /classic-checkout/ (same class of bug that Pilot 5 caught on magellan-pay, independently rediscovered here by 2 Testers)

Skill is doing its job across two very different WC plugin shapes. Validated across 2 WC plugins.

Cross-plugin pattern — block-checkout incompatibility

Observed on:

  • Pilot 5 magellan-pay Issue 1 (CRITICAL, caught)
  • Pilot 8 magellan-checkout-editor — CRITICAL bonus finding (caught independently by 2 Testers)

Two plugins, same class: hooks only woocommerce_checkout_fields (classic API), invisible on WC 10+ default block checkout. Recommendation: add a one-paragraph note in skills/woocommerce-exploration/SKILL.md codifying the probe — "for any WC plugin that hooks woocommerce_checkout_fields or legacy classic-checkout hooks, probe the block checkout visibility as a mandatory cross-version compat check." Not a full amendment, just a skill-file clarification.


4 proposed changes (review & ship as you prefer)

1. Amendment L (new) — Interactive widget event-binding probe

Targets Issues 2 (drag-badge) + 3 (conditional change-event). For every plugin-declared client-side interactive widget (drag-reorder, conditional show/hide, live validation, tab-switch, collapse/expand, inline edit, auto-save), execute the documented interaction and empirically verify the widget re-computes or re-renders. Coverage-note format: interactive widget probed: <widget> reacts to <trigger>? → <Y/N>.

Narrow but generalizable — fits as subsection under "Probe what the feature produces, not just what it does".

2. b6-per-handler tightening to destructive-op AND-list

Targets Miss 2 (b6 aggregate drift). 1-2 sentence addition to the destructive-op AND-list section. See text above.

3. Amendment H classification hint on recon/static-analysis

Targets Miss 3 (custom-select widget not recognized as overlay). Add one sentence to static-analysis and recon anchor lists: grep for data-*-custom-*, replaceWith, .hide(); $.append patterns that replace native form controls — each is a candidate Amendment H target.

4. Artifact AND-list a7 — round-trip identity

Targets Miss 4 (HTML entity export encoding). Add a7 to the existing artifact AND-list: for features that both produce and consume the artifact (export × import), probe round-trip identity with semantically-interesting values ("Tom & Jerry", "<script>", multi-byte). Mismatches indicate encoding boundaries.

5. WooCommerce skill — block-vs-classic checkout paragraph

Targets the cross-plugin pattern observed on Pilot 5 + Pilot 8. Not a full amendment, just a skill-file clarification in skills/woocommerce-exploration/SKILL.md.


Remaining plugin set

Only magellan-theme left from the test-plugin mission set — a block theme. First pilot exercising skills/block-theme-exploration/SKILL.md. Would test:

  • Whether existing amendments hold on theme-vs-plugin kind
  • Whether the theme-specific skill produces probes analogous to ecosystem-exploration skills
  • Whether kind: theme branching in the harness flows cleanly through Phase 1-5

Cross-pilot state

8 pilots, 5 reruns, 21 amendments, 0 regressions in amendment rule text. Three plugins in the WooCommerce ecosystem validated (pay, checkout-editor — 2 blind + Pilot 5 rerun). Sonnet + amendments continues to hold. Amendment K has clean fuel validation. Amendment I has weak-positive conclusive evidence with one drift regression that the proposed b6-per-handler tightening resolves.


Artifacts

  • Final report: runs/2026-04-24T14-16-15_magellan-checkout-editor/final-report.md
  • Escape analysis: runs/2026-04-24T14-16-15_magellan-checkout-editor/escape-analysis.md
  • Token usage: runs/2026-04-24T14-16-15_magellan-checkout-editor/token-usage.json
  • Manifest: runs/2026-04-24T14-16-15_magellan-checkout-editor/manifest.json
  • 6 session reports: runs/2026-04-24T14-16-15_magellan-checkout-editor/sessions/<slug>/report.json
  • Static analysis: runs/2026-04-24T14-16-15_magellan-checkout-editor/static-analysis.md
  • Recon: runs/2026-04-24T14-16-15_magellan-checkout-editor/recon.md
  • Coverage plan: runs/2026-04-24T14-16-15_magellan-checkout-editor/coverage.md
  • Pilot 5 (WC #1) comparison: https://gist.github.com/alopezari/7dd744c19c0ad21b2de8c630513967f6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment