- 3 hypotheses silently skipped / deprioritized (progress-indicator empirical probe ×2, concurrent-op b7)
- 1 charter never executed (breadth-tour: status=pending, no session directory)
- 0 surfaces from recon not addressed by any charter
- 1 AND-list item scored from source inspection without empirical probe (b4 — transaction check; b3 — dry-run check; b5 — undo check)
- 1 round-trip probe gap (restore × pre-snapshot: empirical probe incomplete — only source-inspection used)
- 3 Questions filed with inadequate empirical probe documentation
- 2 forcing-function strings missing from sessions
- 0 hypotheses silently skipped (all hypotheses have verdicts)
- 1 recon-flagged surface insufficiently probed (b6 empirical probe deferred, question filed without empirical evidence)
- 1 AND-list item scored via source-only instead of empirical (b6 across both restore paths)
- 3 round-trip / compositional probes missing (export×import, Pages a3, cron-deactivation lifecycle)
- 1 Question filed only from source inspection with no empirical probe attempt (restore b6)
- 2 forcing-function strings missing (export-artifact-andlist missing scale-sensitive c2 fallback literal; concurrent-trigger-seam missing the required literal form)
- 3 hypotheses silently skipped (CT-2, CT-3, SE-4 never empirically probed)
- 6 surfaces from recon/coverage not addressed (F6 plugin lifecycle — breadth-tour skipped entirely)
- 0 AND-list items scored on aggregate when per-path was needed
- 1 round-trip probe missing (export × re-import — SE-4 deprioritized without empirical discharge)
- 2 Questions that look like Amendment I drift (b4/b7 rollback from source; SCH-5 email from source)
- Forcing-function strings missing from 3 sessions
Pass 2 reassesses the run after the supplementary breadth-tour Tester completed. Six valid session reports now exist. Pass 1 flagged 2 high-severity gaps; pass 2 verdicts: Gap 1 partially closed, Gap 2 still open, plus newly-visible gaps from the breadth-tour report.
- Gap 1 (breadth-tour unowned probes): PARTIALLY CLOSED. 5 of 8 BT hypotheses now have empirical/source evidence on file. 3 remain unprobed: BT4 (true zero-content export — admitted by Tester), BT5 (cron next-run timestamp), BT6 (deactivation cron cleanup). BT5+BT6 explicitly
deprioritizedfor budget — turns_used 30/30. - Gap 2 (Backup × Restore round-trip): STILL OPEN. No charter, including the supplementary breadth-tour, composed create→restore→verify-data-integrity. The marquee feature loop remains empirically unverified.
- NEW: BT3 Amendment I drift. Breadth-tour filed the upload double-submit-protection finding as
confirmed-bugfrom source inspectio
Run ID: 2026-04-28T11-46-58_magellan-checkout-editor
Plugin: magellan-checkout-editor v1.0.0 — WooCommerce extension for custom checkout fields (drag+drop, 7 field types, conditional logic, validation, order-meta, email injection, JSON import/export)
Ecosystem: woocommerce
Stack: Sonnet 4.6 Manager + Sonnet 4.6 Planner × 2 + Haiku 4.5 Testers × 5 (recon also Haiku)
Driver: playwright-cli-headless (Playwright CLI, no MCP — project default). 1 charter overrode to chrome-devtools-headless (Chrome DevTools MCP). See driver section below.
Dispatch: 5 charters in one concurrent wave (2 critical + 3 high; 2 medium pending). Playwright CLI = true parallel (separate processes).
Wallclock: ~22 min end-to-end (Phase 0–5 including recon + static analysis + charter gen + 5 concurrent Testers)
Date: 2026-04-28
Run ID: 2026-04-28T11-35-44_magellan-pay
Plugin: Magellan Pay v1.0.0 — WooCommerce sandbox payment gateway with transaction logging and refund support
Goal: Full-surface evaluation of the Sonnet Manager + Sonnet Planner + Haiku Tester stack on magellan-pay (second plugin in the Pilot 17 series). Compare recall vs Pilot 17c (same stack, same plugin, 6/10 recall) after amendments from Pilot 17c were shipped.
Stack: Sonnet 4.6 Manager · Sonnet 4.6 Planner · Haiku 4.5 Testers
Run ID:2026-04-28T08-59-25_magellan-pay
Plugin: magellan-pay v1.0.0 (WooCommerce sandbox payment gateway, 10 planted bugs)
Driver:playwright-cli-headless(2 charters re-dispatched viachrome-devtools-headless— KI-001)
Run ID: 2026-04-28T06-31-56_magellan-backups
Date: 2026-04-28
Plugin: magellan-backups v1.0.0 (blind greybox — ISSUES.md stripped before run)
Goal: confirm cost-floor projection with all Opus off. Full model stack swap vs Pilot 15 baseline.
Magellan Pilot 16 — magellan-backups (recon-only counterfactual: what does static analysis uniquely contribute?)
Run ID: 2026-04-27T18-25-45_magellan-backups
Plugin: magellan-backups (same regression-test fixture; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: playwright-cli-headed (matched to Pilot 15 — controlled comparison)
Manager model: Opus 4.7
Tester / planner model: Sonnet 4.6 (Testers, recon) + Opus 4.7 (planner Phase 3)
Wallclock: 43 min (18:25:45Z → 19:08:18Z)
Magellan Pilot 14 — magellan-backups (first playwright-cli-headed pilot, 10/10 recall, but Manager cost regressed via cache-creation jump)
Run ID: 2026-04-27T14-14-00_magellan-backups
Plugin: magellan-backups (same regression-test plugin; ISSUES.md stripped — blind greybox)
Kind: plugin
Ecosystem: core
Driver: Playwright CLI (headed) — first pilot of the third browser-driver tier (custom spec-file approach; superseded by f395200 which switched to Microsoft's official @playwright/cli)
Wallclock: 33 min (best of the series)
Total cost: $53.06 (vs Pilot 13 $38.47 — +38%)
Recall: 10/10 (first clean recall under the new architecture)