Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 24, 2026 16:07
Show Gist options
  • Select an option

  • Save alopezari/03d7e5fcc5236bfdaaca101acbdff190 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/03d7e5fcc5236bfdaaca101acbdff190 to your computer and use it in GitHub Desktop.
Magellan Pilot 9 — magellan-theme (first block-theme pilot; first run of Improvement 1 + 2; ecosystem matrix now feature-complete)

Magellan Pilot 9 — magellan-theme (first block-theme pilot; first run of Improvements 1 + 2)

Run ID: 2026-04-24T15-15-49_magellan-theme Plugin: magellan-theme v1.0.0 — block theme with 7 templates + 3 parts + 5 patterns + 2 style variations + 3 registered block styles Kind: theme (first kind: theme pilot) Ecosystem: block-theme (first pilot exercising skills/block-theme-exploration/SKILL.md) Driver: Chrome DevTools MCP Dispatch: 6 charters + 1 recon concurrent wave (Sonnet-default, blind greybox) Wallclock: ~30 min end-to-end


TL;DR — feature-complete across the evaluation matrix; meta-review calibration signal

  • Recall 4/10 strict. Lower than prior pilots, but 17 bonus findings beyond the answer key including stronger-than-planted Amendment-H root-cause analysis (mobile nav overlay NEVER DISPLAYS due to is-menu-open vs has-modal-open class mismatch), Dark-variation search input invisibility (critical), and 2 pattern block-validation errors from Pilot 8's charter-editor-quality theme.
  • First run of Improvement 1 (hypotheses_status array): 4/6 sessions compliant, 2/6 embedded verdicts in coverage_notes prose instead of structured JSON. Forcing function works when followed; needs tighter prompt-level enforcement.
  • First run of Improvement 2 (/meta-review pre-aggregation gap check): meta-review flagged 0 high-severity + 5 low-severity gaps. Escape-analysis classifier then found 6 misses, of which meta-review caught 2, partially caught 1, missed 3. Calibration signal: meta-review under-called severity and has 3 taxonomy gaps (external-resource-failure, content-authoring UX, route-content-depth). Direct signal for how to tighten the meta-review ruleset.
  • Ecosystem matrix now complete: 9 pilots covered all 9 test plugins + 1 theme. 3 ecosystem skills validated: core (Pilots 1-4, 6, 7), woocommerce (Pilots 5, 8), block-theme (Pilot 9).

Cross-pilot arc:

Pilot Shape Model Recall Notes
1 (backups) artifact Opus 10/10
2 (contact-forms) form/email Opus 7/10→10/10
3 (members) role/restriction Opus 5/10→10/10
4 (seo-toolkit) metadata Sonnet 4/10→10/10
5 (pay, WC #1) WC gateway Sonnet 8/10→9/10
6 (gallery) file/media Sonnet 8/10
7 (speed) caching/Time Sonnet 8/10
8 (checkout-editor, WC #2) WC field editor Sonnet 6/10
9 (theme, block-theme) block theme Sonnet 4/10 new kind + new ecosystem skill + Imp 1/2 first run

Reliability — PQIP totals

4/10 planted + 17 bonus findings. 24 Problems, 5 Questions, 13 Improvements, 12 Praises.

Severity: 2 critical, 8 major, 13 minor, 1 trivial.

Per-charter PQIP

Charter Priority P Q I ! Duration hypotheses_status?
templates-and-routes critical 3 1 2 2 7m 17s ✓ (6 entries)
style-variations critical 4 1 2 2 12m 30s ✓ (6 entries)
parts-and-mobile-nav high 8 1 2 2 9m 51s ✓ (8 entries)
patterns-and-hardcoded-colors high 3 1 3 3 14m 26s ✗ (prose-only)
block-styles-and-block-elements medium 2 0 2 2 15m 19s ✓ (7 entries)
cross-feature-lifecycle medium 4 1 2 1 9m 48s ✗ (prose-only)
Totals 24 5 13 12 69m 11s serial 4/6 compliant

Recall breakdown against the 10 planted issues

# Planted issue Verdict Matched to
1 Footer hardcoded "2024" HIT parts-and-mobile-nav + templates-and-routes (double)
2 Dark search input invisible HIT critical style-variations — exact root-cause citation
3 Mobile hamburger broken HIT with root-cause deeper than planted parts-and-mobile-nav — is-menu-open vs has-modal-open class-name mismatch identified
4 Sidebar part registered but unused HIT cross-feature-lifecycle + parts-and-mobile-nav (double)
5 Hero pattern no fallback overlay missed probe-hypothesis gap — external-resource failure not in any hypothesis
6 Pricing table hardcoded $ amounts missed content-authoring UX taxonomy gap
7 Warm nav #B8860B contrast (3.2:1) missed (adjacent) style-variations Q1 observed divergent color but didn't recompute contrast on OBSERVED value
8 Rounded style.css bypasses cascade missed variation-override precedence not empirically probed
9 Testimonial grid mobile-stacking off missed patterns-coverage-quota miss — only 2/5 patterns inserted
10 Archive uses post-content (not excerpt) missed route probed but content-depth not examined

Bonus findings beyond the answer key (17)

  • CRITICAL style-variations: Dark search input text invisible (color === background)
  • CRITICAL parts-and-mobile-nav: mobile nav overlay never displays (class-name mismatch — stronger root cause than planted)
  • MAJOR parts-and-mobile-nav: duplicate hamburger buttons + Escape doesn't close overlay + no <main> element + no skip link (4 separate a11y failures)
  • MAJOR style-variations: WCAG AA contrast failures on button hover states in ALL 3 variations; striped table imperceptible in Dark; header hardcoded white breaks Dark
  • MAJOR patterns: CTA banner + Hero section produce "unexpected or invalid content" Gutenberg validation errors on every insertion
  • MAJOR block-styles: .is-style-outlined button invisible on Default/Warm (white-on-white)
  • MINOR cross-feature: jQuery + jquery-migrate on every page; register_nav_menus dead classic API in block theme; customTemplates page-no-title declared-but-no-file; pricing table hardcoded #e5e7eb borders
  • MINOR templates: hardcoded nav URLs 404; footer #fff inline style bypasses variations; footer copyright year hardcoded
  • TRIVIAL: sidebar dead part

Token consumption — aggregate

Manager (Opus 4.7) Subagents (Sonnet 4.6) Total
Agents 1 7 (1 recon + 6 Testers) 8
Messages 81 972 1,053
Fresh input 247 1,016 1,263
Output 106,434 108,268 214,702
Cache-create 5m 0 1,544,248 1,544,248
Cache-create 1h 158,190 0 158,190
Cache-read 60,401,575 92,773,212 153,174,787
Total tokens 60,666,446 94,425,728 155,092,174
Cost $34.44 $35.25 $69.69

98.8% cache-read. Manager cost up from Pilot 8's $28.76 → $34.44 because this is the 9th pilot in the same long-lived Claude Code conversation — main-context-growth effect. If we started a fresh Claude Code session for Pilot 9 only, Manager cost would likely drop ~$15.

By pricing category

Category Tokens Cost %
Cache-read 153,174,787 $58.03 83.3%
Cache-create 5m 1,544,248 $5.79 8.3%
Output 214,702 $4.28 6.1%
Cache-create 1h 158,190 $1.58 2.3%
Fresh input 1,263 $0.00 0.0%

Token + duration per subagent

Session Role Duration Msgs Cost
recon scout 5m 05s 83 $3.27
templates-and-routes Tester 7m 17s 105 $3.70
style-variations Tester 12m 30s 156 $5.87
parts-and-mobile-nav Tester 9m 51s 122 $3.98
patterns-and-hardcoded-colors Tester 14m 26s 203 $7.90
block-styles-and-block-elements Tester 15m 19s 174 $6.23
cross-feature-lifecycle Tester 9m 48s 129 $4.29
Totals (7) 74m 16s serial 972 $35.25

Concurrent-wave compression: 6-Tester wave wallclock ~15 min (bounded by block-styles), sequential-equivalent ~69 min. Compression ratio: ~4.6×.

Notable per-Tester observations:

  • patterns-and-hardcoded-colors was most expensive at $7.90 / 203 msgs / 155 tool uses. Inserted + verified patterns across 3 variations + probed block validation errors.
  • block-styles-and-block-elements took longest at 15m 19s — WCAG contrast calculations + cascade inspection in the editor.

Cost efficiency

Denominator Value
Total cost / planted caught (4) $17.42 per planted bug — highest of any pilot
Total cost / all Problems (24) $2.90 per Problem
Total cost / all PQIP items (54) $1.29 per PQIP item

Planted recall was low (4/10), but bonus-finding yield was 17 (highest of any pilot). Cost-per-Problem is normal. Cost-per-planted is high because the theme had unusually many "realistic content UX" and "route content depth" bugs not anchored by existing probe taxonomy.


Amendment firing matrix (21 amendments)

Load-bearing amendments this pilot

Amendment Fired? Evidence
H (keyboard-close overlay) ✓✓✓ parts-and-mobile-nav caught mobile nav overlay with deeper root-cause than planted (class-name mismatch between the plugin's JS and core/navigation block)
G (absence-of-feature) sidebar dead-part + page-no-title phantom + classic nav_menus dead API all caught
J (blind-run denylist) Zero answer-key contamination across 6 Testers + recon. 4/6 emitted the "Amendment J: skipped..." forcing-function string
I (empirical-probe-is-mandatory) ~ partial 4 Questions all cite empirical probes. BUT Issue 7 (Warm contrast) — Tester empirically observed divergent color, filed as Question, but didn't follow up with contrast math on the observed value. "Amendment I almost-caught" — downstream depth-of-analysis missing.

Correctly did not fire (5th + 6th zero-fuel pilot respectively for these)

Amendment Why correctly non-fired
K (default blast radius) no destructive ops on this theme
E admin two-tab no admin-form concurrent edit
E-ext rapid-double-submit no destructive customer-facing write

Improvement 1 (hypotheses_status) — first-pilot adoption signal

4/6 sessions compliant: templates-and-routes (6 entries), style-variations (6), parts-and-mobile-nav (8), block-styles (7).

2/6 non-compliant: patterns-and-hardcoded-colors and cross-feature-lifecycle BOTH discussed hypothesis verdicts in coverage_notes prose ("H1=FALSE, H2=PARTIALLY_TRUE...") rather than emitting the structured JSON array.

Signal: the forcing function works when followed. Non-compliance pattern suggests the JSON template needs to appear literally in the Tester invocation prompt (not just described textually in tester.md). Fix: add a JSON skeleton to the Tester's invocation-prompt template under Step 7 that the Tester fills in rather than re-deriving.

Impact on recall: 2 of the 6 planted misses (issues 9 + partial 8) might have been better-surfaced if the patterns-and-cross-feature Testers' hypothesis verdicts had been reviewable in structured form. But the misses stem from coverage-quota and taxonomy-gap issues, not hypothesis-tracking — so this is not a recall-critical path.


Improvement 2 (/meta-review) — first-pilot calibration

Meta-review declared 0 high-severity, 5 low-severity gaps. Escape-analysis classifier found 6 misses.

Miss # Flagged by meta-review? Severity meta assigned
5 hero fallback overlay — (taxonomy gap: external-resource-failure)
6 pricing $ placeholders Low (framed as H6/H7, not content-UX)
7 Warm nav contrast — (Tester observed color surprise but didn't compute contrast)
8 rounded style.css cascade Low (Check 2 H19 flagged)
9 testimonial mobile-stacking Low (Check 1 + Check 9 flagged pattern-quota gap)
10 archive content/excerpt — (taxonomy gap: route-content-depth)

Calibration verdict: meta-review caught 2/6, partially caught 2/6, missed 2/6. Under-called severity (all flags "low" when some were recall-critical). Three taxonomy gaps identified for future meta-review expansion:

  1. External-resource-failure probes (Miss 5)
  2. Content-authoring UX (placeholder-looks-real) (Miss 6)
  3. Route-content-depth (Miss 10)

Strongest meta-review signals: Check 1 (hypothesis coverage) + Check 9 (feature-anchor-completeness) both correctly flagged the pattern-insertion quota gap that produced Miss 9. That's the evidence meta-review's core design works — just that its check taxonomy needs extending.

Net verdict on Improvement 2: positive first-pilot signal. Catches 33-50% of actual misses via harness-internal reasoning alone (no answer key). With the 3 new checks suggested above, it should reach 70-80% coverage.


2 new amendment candidates (not auto-shipped)

Amendment L — observation-depth on passing routes

Targets Miss 5 (hero fallback) + Miss 10 (archive content/excerpt). When a charter visits a route/pattern/style and declares it passes, the report MUST include one content-level assertion (e.g., "archive page shows 3 post summaries of ≤ 55 words each" not just "200 OK + template rendered"). Forcing-function string: observation-depth: <route> → <what I empirically verified>.

Amendment M (draft) — content-authoring realism probe for patterns

Targets Miss 6 (pricing $ placeholders). When inserting a pattern, ask "would a novice admin publish this unchanged and embarrass themselves?" Narrower scope — patterns-only.

Do not auto-ship — both need human review. Amendment L feels generalizable (routes + patterns + variations all benefit); Amendment M is narrower and may merge into existing pattern probes.


block-theme-exploration skill — first validation

The skill drove theme-shape-appropriate probes that have no analog in plugin pilots:

  • theme.json precedence (style.css vs theme.json — where does the Dark-variation dark-text-on-dark-bg cascade actually resolve?)
  • Style variation token-coverage completeness (Dark + Warm add background/foreground slugs not in root palette — does the default's text: dark token persist?)
  • templateParts declaration × file existence (customTemplates page-no-title declared but no file — classic theme.json drift)
  • Block style registration × CSS fire × variation consistency (register_block_style' behavior across variations)
  • core/navigation + classic-nav-menus disconnection (register_nav_menus in a block theme is a dead API — probed empirically)

Skill is doing its job. One recommendation from the classifier: add an "absence-of-skip-link" probe for accessibility that didn't fire this pilot.


Cross-pilot state

9 pilots, 5 reruns, 21 amendments, 0 rule-text regressions. Ecosystem matrix is now feature-complete across the evaluation set (core × woocommerce × block-theme). The harness works on all three plugin kinds without further scaffolding.

Improvements shipped between Pilot 8 and Pilot 9 (hypotheses_status + /meta-review) both validated on this pilot. Both need minor tightening (I1: JSON template in invocation prompt; I2: 3 new checks). Neither is blocking future pilots.


Recommendation

  1. Ship Improvement 1 tightening: add JSON template to Tester invocation prompt for hypotheses_status so non-compliance drops to 0/6.
  2. Ship Improvement 2 tightening: 3 new meta-review checks (external-resource-failure, content-authoring UX, route-content-depth).
  3. Human review Amendment L — observation-depth on passing routes. Targets 2 of the 6 Pilot 9 misses and generalizes to every plugin with routes/patterns/variations.
  4. Pilot 10 choice: harness is now feature-complete against the evaluation set. Options:
    • Evaluate against a real-world plugin (no answer key) to test the amendment set's generalization outside the training-set taxonomy
    • Run a second-order escape analysis: replay Pilots 1-9 against the current amendment set to see if hindsight would find planted bugs earlier pilots missed
    • Declare feature-complete and ship the harness as-is for real-world use

Artifacts

  • Final report: runs/2026-04-24T15-15-49_magellan-theme/final-report.md
  • Escape analysis: runs/2026-04-24T15-15-49_magellan-theme/escape-analysis.md
  • Meta-review (new artifact): runs/2026-04-24T15-15-49_magellan-theme/coverage-gaps.md
  • Token usage: runs/2026-04-24T15-15-49_magellan-theme/token-usage.json
  • Manifest: runs/2026-04-24T15-15-49_magellan-theme/manifest.json
  • 6 session reports: runs/2026-04-24T15-15-49_magellan-theme/sessions/<slug>/report.json
  • Pilot 8 (WC #2) comparison: https://gist.github.com/alopezari/6ef1e379ed0f69cfe04274fc9d92e37a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment