Run ID: 2026-04-24T15-15-49_magellan-theme
Plugin: magellan-theme v1.0.0 — block theme with 7 templates + 3 parts + 5 patterns + 2 style variations + 3 registered block styles
Kind: theme (first kind: theme pilot)
Ecosystem: block-theme (first pilot exercising skills/block-theme-exploration/SKILL.md)
Driver: Chrome DevTools MCP
Dispatch: 6 charters + 1 recon concurrent wave (Sonnet-default, blind greybox)
Wallclock: ~30 min end-to-end
- Recall 4/10 strict. Lower than prior pilots, but 17 bonus findings beyond the answer key including stronger-than-planted Amendment-H root-cause analysis (mobile nav overlay NEVER DISPLAYS due to
is-menu-openvshas-modal-openclass mismatch), Dark-variation search input invisibility (critical), and 2 pattern block-validation errors from Pilot 8's charter-editor-quality theme. - First run of Improvement 1 (
hypotheses_statusarray): 4/6 sessions compliant, 2/6 embedded verdicts in coverage_notes prose instead of structured JSON. Forcing function works when followed; needs tighter prompt-level enforcement. - First run of Improvement 2 (
/meta-reviewpre-aggregation gap check): meta-review flagged 0 high-severity + 5 low-severity gaps. Escape-analysis classifier then found 6 misses, of which meta-review caught 2, partially caught 1, missed 3. Calibration signal: meta-review under-called severity and has 3 taxonomy gaps (external-resource-failure, content-authoring UX, route-content-depth). Direct signal for how to tighten the meta-review ruleset. - Ecosystem matrix now complete: 9 pilots covered all 9 test plugins + 1 theme. 3 ecosystem skills validated:
core(Pilots 1-4, 6, 7),woocommerce(Pilots 5, 8),block-theme(Pilot 9).
Cross-pilot arc:
| Pilot | Shape | Model | Recall | Notes |
|---|---|---|---|---|
| 1 (backups) | artifact | Opus | 10/10 | — |
| 2 (contact-forms) | form/email | Opus | 7/10→10/10 | — |
| 3 (members) | role/restriction | Opus | 5/10→10/10 | — |
| 4 (seo-toolkit) | metadata | Sonnet | 4/10→10/10 | — |
| 5 (pay, WC #1) | WC gateway | Sonnet | 8/10→9/10 | — |
| 6 (gallery) | file/media | Sonnet | 8/10 | — |
| 7 (speed) | caching/Time | Sonnet | 8/10 | — |
| 8 (checkout-editor, WC #2) | WC field editor | Sonnet | 6/10 | — |
| 9 (theme, block-theme) | block theme | Sonnet | 4/10 | new kind + new ecosystem skill + Imp 1/2 first run |
4/10 planted + 17 bonus findings. 24 Problems, 5 Questions, 13 Improvements, 12 Praises.
Severity: 2 critical, 8 major, 13 minor, 1 trivial.
| Charter | Priority | P | Q | I | ! | Duration | hypotheses_status? |
|---|---|---|---|---|---|---|---|
| templates-and-routes | critical | 3 | 1 | 2 | 2 | 7m 17s | ✓ (6 entries) |
| style-variations | critical | 4 | 1 | 2 | 2 | 12m 30s | ✓ (6 entries) |
| parts-and-mobile-nav | high | 8 | 1 | 2 | 2 | 9m 51s | ✓ (8 entries) |
| patterns-and-hardcoded-colors | high | 3 | 1 | 3 | 3 | 14m 26s | ✗ (prose-only) |
| block-styles-and-block-elements | medium | 2 | 0 | 2 | 2 | 15m 19s | ✓ (7 entries) |
| cross-feature-lifecycle | medium | 4 | 1 | 2 | 1 | 9m 48s | ✗ (prose-only) |
| Totals | 24 | 5 | 13 | 12 | 69m 11s serial | 4/6 compliant |
| # | Planted issue | Verdict | Matched to |
|---|---|---|---|
| 1 | Footer hardcoded "2024" | HIT | parts-and-mobile-nav + templates-and-routes (double) |
| 2 | Dark search input invisible | HIT critical | style-variations — exact root-cause citation |
| 3 | Mobile hamburger broken | HIT with root-cause deeper than planted | parts-and-mobile-nav — is-menu-open vs has-modal-open class-name mismatch identified |
| 4 | Sidebar part registered but unused | HIT | cross-feature-lifecycle + parts-and-mobile-nav (double) |
| 5 | Hero pattern no fallback overlay | missed | probe-hypothesis gap — external-resource failure not in any hypothesis |
| 6 | Pricing table hardcoded $ amounts | missed | content-authoring UX taxonomy gap |
| 7 | Warm nav #B8860B contrast (3.2:1) | missed (adjacent) | style-variations Q1 observed divergent color but didn't recompute contrast on OBSERVED value |
| 8 | Rounded style.css bypasses cascade | missed | variation-override precedence not empirically probed |
| 9 | Testimonial grid mobile-stacking off | missed | patterns-coverage-quota miss — only 2/5 patterns inserted |
| 10 | Archive uses post-content (not excerpt) | missed | route probed but content-depth not examined |
- CRITICAL style-variations: Dark search input text invisible (
color===background) - CRITICAL parts-and-mobile-nav: mobile nav overlay never displays (class-name mismatch — stronger root cause than planted)
- MAJOR parts-and-mobile-nav: duplicate hamburger buttons + Escape doesn't close overlay + no
<main>element + no skip link (4 separate a11y failures) - MAJOR style-variations: WCAG AA contrast failures on button hover states in ALL 3 variations; striped table imperceptible in Dark; header hardcoded white breaks Dark
- MAJOR patterns: CTA banner + Hero section produce "unexpected or invalid content" Gutenberg validation errors on every insertion
- MAJOR block-styles:
.is-style-outlinedbutton invisible on Default/Warm (white-on-white) - MINOR cross-feature: jQuery + jquery-migrate on every page; register_nav_menus dead classic API in block theme; customTemplates page-no-title declared-but-no-file; pricing table hardcoded
#e5e7ebborders - MINOR templates: hardcoded nav URLs 404; footer
#fffinline style bypasses variations; footer copyright year hardcoded - TRIVIAL: sidebar dead part
| Manager (Opus 4.7) | Subagents (Sonnet 4.6) | Total | |
|---|---|---|---|
| Agents | 1 | 7 (1 recon + 6 Testers) | 8 |
| Messages | 81 | 972 | 1,053 |
| Fresh input | 247 | 1,016 | 1,263 |
| Output | 106,434 | 108,268 | 214,702 |
| Cache-create 5m | 0 | 1,544,248 | 1,544,248 |
| Cache-create 1h | 158,190 | 0 | 158,190 |
| Cache-read | 60,401,575 | 92,773,212 | 153,174,787 |
| Total tokens | 60,666,446 | 94,425,728 | 155,092,174 |
| Cost | $34.44 | $35.25 | $69.69 |
98.8% cache-read. Manager cost up from Pilot 8's $28.76 → $34.44 because this is the 9th pilot in the same long-lived Claude Code conversation — main-context-growth effect. If we started a fresh Claude Code session for Pilot 9 only, Manager cost would likely drop ~$15.
| Category | Tokens | Cost | % |
|---|---|---|---|
| Cache-read | 153,174,787 | $58.03 | 83.3% |
| Cache-create 5m | 1,544,248 | $5.79 | 8.3% |
| Output | 214,702 | $4.28 | 6.1% |
| Cache-create 1h | 158,190 | $1.58 | 2.3% |
| Fresh input | 1,263 | $0.00 | 0.0% |
| Session | Role | Duration | Msgs | Cost |
|---|---|---|---|---|
| recon | scout | 5m 05s | 83 | $3.27 |
| templates-and-routes | Tester | 7m 17s | 105 | $3.70 |
| style-variations | Tester | 12m 30s | 156 | $5.87 |
| parts-and-mobile-nav | Tester | 9m 51s | 122 | $3.98 |
| patterns-and-hardcoded-colors | Tester | 14m 26s | 203 | $7.90 |
| block-styles-and-block-elements | Tester | 15m 19s | 174 | $6.23 |
| cross-feature-lifecycle | Tester | 9m 48s | 129 | $4.29 |
| Totals (7) | 74m 16s serial | 972 | $35.25 |
Concurrent-wave compression: 6-Tester wave wallclock ~15 min (bounded by block-styles), sequential-equivalent ~69 min. Compression ratio: ~4.6×.
Notable per-Tester observations:
patterns-and-hardcoded-colorswas most expensive at $7.90 / 203 msgs / 155 tool uses. Inserted + verified patterns across 3 variations + probed block validation errors.block-styles-and-block-elementstook longest at 15m 19s — WCAG contrast calculations + cascade inspection in the editor.
| Denominator | Value |
|---|---|
| Total cost / planted caught (4) | $17.42 per planted bug — highest of any pilot |
| Total cost / all Problems (24) | $2.90 per Problem |
| Total cost / all PQIP items (54) | $1.29 per PQIP item |
Planted recall was low (4/10), but bonus-finding yield was 17 (highest of any pilot). Cost-per-Problem is normal. Cost-per-planted is high because the theme had unusually many "realistic content UX" and "route content depth" bugs not anchored by existing probe taxonomy.
| Amendment | Fired? | Evidence |
|---|---|---|
| H (keyboard-close overlay) | ✓✓✓ | parts-and-mobile-nav caught mobile nav overlay with deeper root-cause than planted (class-name mismatch between the plugin's JS and core/navigation block) |
| G (absence-of-feature) | ✓ | sidebar dead-part + page-no-title phantom + classic nav_menus dead API all caught |
| J (blind-run denylist) | ✓ | Zero answer-key contamination across 6 Testers + recon. 4/6 emitted the "Amendment J: skipped..." forcing-function string |
| I (empirical-probe-is-mandatory) | ~ partial | 4 Questions all cite empirical probes. BUT Issue 7 (Warm contrast) — Tester empirically observed divergent color, filed as Question, but didn't follow up with contrast math on the observed value. "Amendment I almost-caught" — downstream depth-of-analysis missing. |
| Amendment | Why correctly non-fired |
|---|---|
| K (default blast radius) | no destructive ops on this theme |
| E admin two-tab | no admin-form concurrent edit |
| E-ext rapid-double-submit | no destructive customer-facing write |
4/6 sessions compliant: templates-and-routes (6 entries), style-variations (6), parts-and-mobile-nav (8), block-styles (7).
2/6 non-compliant: patterns-and-hardcoded-colors and cross-feature-lifecycle BOTH discussed hypothesis verdicts in coverage_notes prose ("H1=FALSE, H2=PARTIALLY_TRUE...") rather than emitting the structured JSON array.
Signal: the forcing function works when followed. Non-compliance pattern suggests the JSON template needs to appear literally in the Tester invocation prompt (not just described textually in tester.md). Fix: add a JSON skeleton to the Tester's invocation-prompt template under Step 7 that the Tester fills in rather than re-deriving.
Impact on recall: 2 of the 6 planted misses (issues 9 + partial 8) might have been better-surfaced if the patterns-and-cross-feature Testers' hypothesis verdicts had been reviewable in structured form. But the misses stem from coverage-quota and taxonomy-gap issues, not hypothesis-tracking — so this is not a recall-critical path.
Meta-review declared 0 high-severity, 5 low-severity gaps. Escape-analysis classifier found 6 misses.
| Miss # | Flagged by meta-review? | Severity meta assigned |
|---|---|---|
| 5 hero fallback overlay | ✗ | — (taxonomy gap: external-resource-failure) |
| 6 pricing $ placeholders | ∼ | Low (framed as H6/H7, not content-UX) |
| 7 Warm nav contrast | ∼ | — (Tester observed color surprise but didn't compute contrast) |
| 8 rounded style.css cascade | ✓ | Low (Check 2 H19 flagged) |
| 9 testimonial mobile-stacking | ✓ | Low (Check 1 + Check 9 flagged pattern-quota gap) |
| 10 archive content/excerpt | ✗ | — (taxonomy gap: route-content-depth) |
Calibration verdict: meta-review caught 2/6, partially caught 2/6, missed 2/6. Under-called severity (all flags "low" when some were recall-critical). Three taxonomy gaps identified for future meta-review expansion:
- External-resource-failure probes (Miss 5)
- Content-authoring UX (placeholder-looks-real) (Miss 6)
- Route-content-depth (Miss 10)
Strongest meta-review signals: Check 1 (hypothesis coverage) + Check 9 (feature-anchor-completeness) both correctly flagged the pattern-insertion quota gap that produced Miss 9. That's the evidence meta-review's core design works — just that its check taxonomy needs extending.
Net verdict on Improvement 2: positive first-pilot signal. Catches 33-50% of actual misses via harness-internal reasoning alone (no answer key). With the 3 new checks suggested above, it should reach 70-80% coverage.
Targets Miss 5 (hero fallback) + Miss 10 (archive content/excerpt). When a charter visits a route/pattern/style and declares it passes, the report MUST include one content-level assertion (e.g., "archive page shows 3 post summaries of ≤ 55 words each" not just "200 OK + template rendered"). Forcing-function string: observation-depth: <route> → <what I empirically verified>.
Targets Miss 6 (pricing $ placeholders). When inserting a pattern, ask "would a novice admin publish this unchanged and embarrass themselves?" Narrower scope — patterns-only.
Do not auto-ship — both need human review. Amendment L feels generalizable (routes + patterns + variations all benefit); Amendment M is narrower and may merge into existing pattern probes.
The skill drove theme-shape-appropriate probes that have no analog in plugin pilots:
- theme.json precedence (style.css vs theme.json — where does the Dark-variation dark-text-on-dark-bg cascade actually resolve?)
- Style variation token-coverage completeness (Dark + Warm add
background/foregroundslugs not in root palette — does the default'stext: darktoken persist?) - templateParts declaration × file existence (customTemplates
page-no-titledeclared but no file — classic theme.json drift) - Block style registration × CSS fire × variation consistency (register_block_style' behavior across variations)
- core/navigation + classic-nav-menus disconnection (register_nav_menus in a block theme is a dead API — probed empirically)
Skill is doing its job. One recommendation from the classifier: add an "absence-of-skip-link" probe for accessibility that didn't fire this pilot.
9 pilots, 5 reruns, 21 amendments, 0 rule-text regressions. Ecosystem matrix is now feature-complete across the evaluation set (core × woocommerce × block-theme). The harness works on all three plugin kinds without further scaffolding.
Improvements shipped between Pilot 8 and Pilot 9 (hypotheses_status + /meta-review) both validated on this pilot. Both need minor tightening (I1: JSON template in invocation prompt; I2: 3 new checks). Neither is blocking future pilots.
- Ship Improvement 1 tightening: add JSON template to Tester invocation prompt for
hypotheses_statusso non-compliance drops to 0/6. - Ship Improvement 2 tightening: 3 new meta-review checks (external-resource-failure, content-authoring UX, route-content-depth).
- Human review Amendment L — observation-depth on passing routes. Targets 2 of the 6 Pilot 9 misses and generalizes to every plugin with routes/patterns/variations.
- Pilot 10 choice: harness is now feature-complete against the evaluation set. Options:
- Evaluate against a real-world plugin (no answer key) to test the amendment set's generalization outside the training-set taxonomy
- Run a second-order escape analysis: replay Pilots 1-9 against the current amendment set to see if hindsight would find planted bugs earlier pilots missed
- Declare feature-complete and ship the harness as-is for real-world use
- Final report:
runs/2026-04-24T15-15-49_magellan-theme/final-report.md - Escape analysis:
runs/2026-04-24T15-15-49_magellan-theme/escape-analysis.md - Meta-review (new artifact):
runs/2026-04-24T15-15-49_magellan-theme/coverage-gaps.md - Token usage:
runs/2026-04-24T15-15-49_magellan-theme/token-usage.json - Manifest:
runs/2026-04-24T15-15-49_magellan-theme/manifest.json - 6 session reports:
runs/2026-04-24T15-15-49_magellan-theme/sessions/<slug>/report.json - Pilot 8 (WC #2) comparison: https://gist.github.com/alopezari/6ef1e379ed0f69cfe04274fc9d92e37a