Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 24, 2026 11:24
Show Gist options
  • Select an option

  • Save alopezari/7dd744c19c0ad21b2de8c630513967f6 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/7dd744c19c0ad21b2de8c630513967f6 to your computer and use it in GitHub Desktop.
Magellan Pilot 5 — magellan-pay (first WooCommerce ecosystem pilot, 8/10 blind Sonnet — highest blind-run recall in project history)

Magellan pilot 5 — magellan-pay (first WooCommerce ecosystem pilot, blind Sonnet run)

Run ID: 2026-04-24T10-29-21_magellan-pay Plugin: magellan-pay v1.0.0 — sandbox WooCommerce payment gateway + transaction log + refund support Ecosystem: woocommerce (first pilot exercising skills/woocommerce-exploration/SKILL.md) Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: 6 charters in one concurrent wave (Sonnet-default Testers, blind greybox — ISSUES.md stripped) Wallclock (wave): ~34 min for 6 concurrent Testers; ~60 min total including recon + classifier + aggregation


TL;DR — a new kind of win

Recall 8/10 on a blind pilot against a plugin shape the harness has never seen. Highest blind-run recall in project history. No rerun needed to get here. Previously:

Pilot Shape Mode Original Rerun after amendments
1 (backups) admin artifact-producer Opus blind 10/10
2 (contact-forms) form + email Opus blind 7/10 10/10
3 (members) role / restriction / CRUD Opus blind 5/10 10/10
4 (seo-toolkit) metadata / sitemap / rendering Sonnet blind 4/10 10/10
5 (pay) WC payment gateway Sonnet blind 8/10

The 15 amendments from pilots 1-4 are compounding: one new plugin shape, one new ecosystem, blind, 8/10 first try. 12 of the 15 amendments fired actively; 2 correctly didn't fire (no fuel); 1 partial-gap (Amendment E admin two-tab didn't cover the adjacent customer-submit surface — see miss analysis).


Reliability — PQIP totals

8/10 planted + 10 bonus findings. 18 Problems, 5 Questions, 12 Improvements, 14 Praises.

Severity: 3 critical, 11 major, 4 minor, 0 trivial.

Per-charter PQIP

Charter Priority P Q I ! Duration Tool uses Planted caught
gateway-settings critical 4 1 1 2 11m 25s 60 #1, #4, #6, #7 (Amendment F-driven)
checkout-payment-flow critical 6 1 3 3 19m 49s 84 #8, #9 (+ 4 bonus criticals/majors: block checkout, Luhn, expiry, CVC)
refund-flow critical 2 1 2 2 19m 01s 84 #3
transaction-log-scale high 2 0 2 3 15m 36s 55 #2 + bonus missing-index
cross-feature-stock-refund high 3 1 3 1 34m 22s 145 #8 (double-caught) + bonus disabled-gateway-still-processes
transaction-log-admin medium 1 1 1 3 17m 35s 73 + bonus critical (stored XSS via status class attr)
Totals 18 5 12 14 117m 48s serial 501 8/10

Before/after against the 10 planted issues

# Planted issue Verdict Amendment that fired
1 No test mode indicator on checkout title caught-bundled Amendment 2 (absence-of-feature, the stronger root-cause "testmode is fully dead")
2 Transaction log has no pagination caught-exact scale probe (c2 seed 10k rows)
3 Refund button works for wrong gateway caught-exact destructive-op AND-list + Amendment 4 cross-feature
4 Test mode doesn't separate test/live keys caught-bundled Amendment 2 (api_key/secret never consumed at all)
5 FLOAT vs DECIMAL for money missed no existing rule covers DDL column-type semantics
6 Empty API key saves silently in live mode caught-bundled Amendment 2 (subset of "keys never consumed")
7 API keys visible in page source as plain text caught-exact Amendment F (view-source)
8 Stock reduced before payment confirmation caught-exact (x2) Amendment 4 (cross-feature MANDATORY) + Amendment C (enumerate root cause)
9 Zero-total orders sent to gateway caught-exact Amendment C (sibling symptom of Issue 8 root cause)
10 Double-click creates duplicate orders missed Amendment E (two-tab concurrent) covers admin forms only, not customer-submit

2 misses, 1 amendment family each — proposed as Amendment G (DDL types) + extension of Amendment E (rapid-double-submit). See "Next steps" below.


Token consumption — aggregate

Manager (Opus 4.7) Subagents (Sonnet 4.6) Total
Agents 1 7 (1 recon + 6 Testers) 8
Messages 126 777 903
Fresh input 374 821 1,195
Output 142,510 115,035 257,545
Cache-create 5m 0 1,704,015 1,704,015
Cache-create 1h 579,879 0 579,879
Cache-read 30,240,716 71,101,881 101,342,597
Total tokens 30,963,479 72,921,752 103,885,231
Cost $24.48 $29.45 $53.93

97.6% of all tokens are cache-read. The prompt-caching discipline (1h cache for Manager, 5m cache for per-Tester context) remains the dominant cost-saving mechanism.

By pricing category

Category Tokens Cost % of cost
Cache-read 101,342,597 $36.45 67.6%
Cache-create 5m 1,704,015 $6.39 11.9%
Cache-create 1h 579,879 $5.80 10.7%
Output 257,545 $5.29 9.8%
Fresh input 1,195 $0.00 0.0%

Token + duration per subagent (7 sessions: recon + 6 Testers)

Session Role Duration Tool uses Msgs Input Output cc5m cr Cost
recon scout 3m 16s 36 51 57 6,760 221,734 3,082,232 $1.86
gateway-settings Tester 11m 25s 60 92 98 12,063 293,797 7,974,676 $3.68
transaction-log-scale Tester 15m 36s 55 80 86 14,036 192,861 6,472,804 $2.88
transaction-log-admin Tester 17m 35s 73 103 109 13,213 204,583 9,484,167 $3.81
refund-flow Tester 19m 01s 84 124 130 17,110 246,250 12,911,194 $5.05
checkout-payment-flow Tester 19m 49s 84 126 132 18,564 203,789 11,531,464 $4.50
cross-feature-stock-refund Tester 34m 22s 145 201 209 33,289 341,001 19,645,344 $7.67
Totals (7) 121m 04s serial 537 777 821 115,035 1,704,015 71,101,881 $29.45

Concurrent-wave compression: 6 Tester charters dispatched at roughly the same time; total sequential-equivalent = 117m 48s, wallclock of the wave = ~34m 22s (bounded by the longest Tester, cross-feature-stock-refund). Compression ratio: 3.4×.

Per-charter observations:

  • cross-feature-stock-refund is the outlier at 34m / 145 tool uses — three cross-feature seams required separate flows (payment × stock, refund × stock, settings × payment), each with its own setup. High tool-use count reflects the charter's ambition, not inefficiency; it caught the critical stock-leak bug (double-catch with checkout-payment-flow) plus the bonus "disabled-gateway-still-processes-payments" critical.
  • gateway-settings is the cheapest Tester at $3.68 — settings form + view-source + source-grep work, minimal browser state. Finished first in the wave.
  • Tool-call serialization is the wave's bottleneck. Chrome DevTools MCP serializes tool handlers through a server mutex. Model reasoning parallelizes; tool dispatch queues. For 6 Testers making 60-145 tool calls each, the queue kept all 6 productive throughout and compression held at ~3.4×.

Cost efficiency

Denominator Value
Total cost / planted caught $6.74 per planted bug (8 caught)
Total cost / all Problems filed $3.00 per Problem (18 filed)
Total cost / total PQIP items $1.10 per PQIP item (49 items total)

Compared to Pilot 4 rerun ($4.12/planted productive / $5.50/planted actual) — Pilot 5 is slightly higher per-planted because the ecosystem (WC) required more per-Tester setup work (baseline plugin, product creation, coming-soon mode, etc.) and the cross-feature-stock-refund charter ran long. The bonus-findings factor is much better: 10 bonus Problems means total-Problem cost is only $3.00 vs Pilot 4's $2.04 — but on the planted-only metric, $6.74 reflects the harder plugin shape and first-of-kind ecosystem exercise.

No wasted cost this run (no laptop-sleep retry). Clean single-wave execution.


Amendment firing matrix

12 of 15 existing amendments fired actively. 2 correctly did not fire (no fuel). 1 (Amendment E) had a partial gap.

Amendment Fired? Load-bearing? Notes
1. Empty / one-item / full states Mandatory coverage-note string in all 6 sessions
2. Absence-of-feature ✓✓ Yes 3 major findings in gateway-settings (testmode, api_key/secret, uninstall hook)
3. Plugin-native writes over synthetic seeds Used real checkout/refund paths; direct DB seed only for scale charter
4. Cross-feature interaction (MANDATORY) ✓✓✓ Yes Caught Issue 8 stock-before-payment — the single-highest-impact bug. Surfaced 2 bonus criticals.
5. UI-path before "missing" claim No empirically-wrong claims filed
A. Inline counters n/a No live-counter UI — correct non-activation
B. Seed state variety transaction-log-admin seeded each status + XSS probe row
C. Enumerate root-cause surface ✓✓ Yes Tester chained 3 bugs from wc_reduce_stock_levels-before-API-call root cause
D. Unsaved-work protection ✓ (negative) Probed; WC core provides it; no Problem filed (correct)
E. Two-tab concurrent (MANDATORY) partial / gap Fired on admin forms; didn't cover customer-submit (Issue 10 missed — propose extension)
F. View-source HTML ✓✓ Yes Caught Issue 7 (API Secret plaintext) — DOM would have normalized; raw HTML fetch was the difference
Reinforce 5 empty-state 6/6 sessions carry the mandated coverage-note string
Reinforce 8 cross-feature 5/6 sessions carry the mandated string
pqip.propagate-sibling-features Luhn → expiry → CVC all filed together; refund stock-restore-absence flagged as sibling to stock-leak
pqip.UI-path-before-claim No over-claims filed

First-of-kind validations

Three "first time the harness does this" things worked:

  1. woocommerce-exploration skill activation — recon identified block-checkout incompatibility, wc_reduce_stock_levels semantics, restock_items flag, HPOS considerations before Tester dispatch. Visible in Tester outputs (e.g., _order_stock_reduced meta analysis, direct process_refund call through wc_get_order).
  2. Concurrent wave on a WC-ecosystem plugin — 6 Testers each provisioning Studio + installing WooCommerce + activating magellan-pay + executing charter. Zero provision-time failures. Compression 3.4×.
  3. Amendment 4 (cross-feature MANDATORY) as designed — the dedicated cross-feature-stock-refund charter IS the forcing function that caught Issue 8, the highest-severity bug. Not a coincidence; this is exactly what the MANDATORY-reinforcement shipping in Pilot 4 rerun was designed to produce.

Next steps — 2 amendment proposals + Pilot 6

Amendment G — inspect DDL column types for value semantics

Miss 1 (Issue 5 FLOAT for money) is a generalizable bug class — the harness's DB-writing anchor probes injection + insert-return but not column-type appropriateness.

Proposed rule (draft):

Inspect column types for value semantics, not just column existence. When a plugin's CREATE TABLE / dbDelta DDL stores money, time, identifiers, or any data with strict correctness requirements, verify the column type matches the semantic. Money → DECIMAL(p,s) never FLOAT/DOUBLE (IEEE-754 rounding corrupts cents). Timestamps → DATETIME/BIGINT for epoch, never INT (Y2038). IDs that reference external systems → check type matches the upstream (UUID vs BIGINT vs VARCHAR). Reproducer: write a known value (e.g. 19.99), read raw via $wpdb->get_var, compare bytewise.

Targets Miss 1 as a class. Ships to skills/tester-mindset/SKILL.md in the DB-writing anchor section.

Amendment E extension — rapid-double-submit on customer-facing write actions

Miss 2 (Issue 10 double-click duplicate orders) is an adjacent shape to the existing Amendment E's admin-form two-tab probe. Same concurrency family, different surface.

Proposed extension:

Rapid-double-submit probe on user-facing write actions. Any user-facing form that performs a database write or external side effect on submit (checkout, place order, register, subscribe, post comment, contact form) must be probed for rapid-double-click idempotency. Reproducer: programmatically click the submit button twice within 200ms (await page.click(sel); await page.click(sel);) or fire two form.requestSubmit() calls. Check for disabled-on-click state, nonce-based idempotency key, or server-side duplicate-write detection.

Targets Miss 2 as a class. Extends existing Amendment E block.

Pilot 6 — magellan-gallery

Recommended because file-handling (upload MIME, path traversal, image-size variants, visibility) is a well-understood bug class currently absent from the harness history. Will test whether proposed Amendment G's column-type rule generalizes to file-metadata storage. Defer magellan-speed to a later pilot — SFDPOT Time deserves its own focused run.


Loop shape (unchanged)

pilot → /escape-analysis → human reviews proposals
     → commit amendments → re-run → /escape-analysis (validation pass)
     → confirm 10/10 → log to docs/harness-retrospectives.md → next pilot

For Pilot 5 specifically, the rerun is optional — 8/10 blind is already at the "converts to 10 with known gaps filled" threshold. The two proposed amendments (G + E-extension) can either ship before a rerun to validate compounding, or ship together with Pilot 6 for a cleaner attribution test.


Artifacts

  • Final report: runs/2026-04-24T10-29-21_magellan-pay/final-report.md
  • Escape analysis: runs/2026-04-24T10-29-21_magellan-pay/escape-analysis.md
  • Token usage (full detail): runs/2026-04-24T10-29-21_magellan-pay/token-usage.json
  • Manifest: runs/2026-04-24T10-29-21_magellan-pay/manifest.json
  • 6 session reports: runs/2026-04-24T10-29-21_magellan-pay/sessions/<slug>/report.json
  • Static analysis: runs/2026-04-24T10-29-21_magellan-pay/static-analysis.md
  • Recon: runs/2026-04-24T10-29-21_magellan-pay/recon.md
  • Coverage plan: runs/2026-04-24T10-29-21_magellan-pay/coverage.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment