Magellan Pilot 9 — magellan-theme (first block-theme pilot; first run of Improvements 1 + 2)

Run ID: 2026-04-24T15-15-49_magellan-theme Plugin: magellan-theme v1.0.0 — block theme with 7 templates + 3 parts + 5 patterns + 2 style variations + 3 registered block styles Kind: theme (first kind: theme pilot) Ecosystem: block-theme (first pilot exercising skills/block-theme-exploration/SKILL.md) Driver: Chrome DevTools MCP Dispatch: 6 charters + 1 recon concurrent wave (Sonnet-default, blind greybox) Wallclock: ~30 min end-to-end

TL;DR — feature-complete across the evaluation matrix; meta-review calibration signal

Recall 4/10 strict. Lower than prior pilots, but 17 bonus findings beyond the answer key including stronger-than-planted Amendment-H root-cause analysis (mobile nav overlay NEVER DISPLAYS due to is-menu-open vs has-modal-open class mismatch), Dark-variation search input invisibility (critical), and 2 pattern block-validation errors from Pilot 8's charter-editor-quality theme.
First run of Improvement 1 (hypotheses_status array): 4/6 sessions compliant, 2/6 embedded verdicts in coverage_notes prose instead of structured JSON. Forcing function works when followed; needs tighter prompt-level enforcement.
First run of Improvement 2 (/meta-review pre-aggregation gap check): meta-review flagged 0 high-severity + 5 low-severity gaps. Escape-analysis classifier then found 6 misses, of which meta-review caught 2, partially caught 1, missed 3. Calibration signal: meta-review under-called severity and has 3 taxonomy gaps (external-resource-failure, content-authoring UX, route-content-depth). Direct signal for how to tighten the meta-review ruleset.
Ecosystem matrix now complete: 9 pilots covered all 9 test plugins + 1 theme. 3 ecosystem skills validated: core (Pilots 1-4, 6, 7), woocommerce (Pilots 5, 8), block-theme (Pilot 9).

Cross-pilot arc:

Pilot	Shape	Model	Recall	Notes
1 (backups)	artifact	Opus	10/10	—
2 (contact-forms)	form/email	Opus	7/10→10/10	—
3 (members)	role/restriction	Opus	5/10→10/10	—
4 (seo-toolkit)	metadata	Sonnet	4/10→10/10	—
5 (pay, WC #1)	WC gateway	Sonnet	8/10→9/10	—
6 (gallery)	file/media	Sonnet	8/10	—
7 (speed)	caching/Time	Sonnet	8/10	—
8 (checkout-editor, WC #2)	WC field editor	Sonnet	6/10	—
9 (theme, block-theme)	block theme	Sonnet	4/10	new kind + new ecosystem skill + Imp 1/2 first run

Reliability — PQIP totals

4/10 planted + 17 bonus findings. 24 Problems, 5 Questions, 13 Improvements, 12 Praises.

Severity: 2 critical, 8 major, 13 minor, 1 trivial.

Per-charter PQIP

Charter	Priority	P	Q	I	!	Duration	`hypotheses_status`?
templates-and-routes	critical	3	1	2	2	7m 17s	✓ (6 entries)
style-variations	critical	4	1	2	2	12m 30s	✓ (6 entries)
parts-and-mobile-nav	high	8	1	2	2	9m 51s	✓ (8 entries)
patterns-and-hardcoded-colors	high	3	1	3	3	14m 26s	✗ (prose-only)
block-styles-and-block-elements	medium	2	0	2	2	15m 19s	✓ (7 entries)
cross-feature-lifecycle	medium	4	1	2	1	9m 48s	✗ (prose-only)
Totals		24	5	13	12	69m 11s serial	4/6 compliant

Recall breakdown against the 10 planted issues

#	Planted issue	Verdict	Matched to
1	Footer hardcoded "2024"	HIT	parts-and-mobile-nav + templates-and-routes (double)
2	Dark search input invisible	HIT critical	style-variations — exact root-cause citation
3	Mobile hamburger broken	HIT with root-cause deeper than planted	parts-and-mobile-nav — `is-menu-open` vs `has-modal-open` class-name mismatch identified
4	Sidebar part registered but unused	HIT	cross-feature-lifecycle + parts-and-mobile-nav (double)
5	Hero pattern no fallback overlay	missed	probe-hypothesis gap — external-resource failure not in any hypothesis
6	Pricing table hardcoded $ amounts	missed	content-authoring UX taxonomy gap
7	Warm nav #B8860B contrast (3.2:1)	missed (adjacent)	style-variations Q1 observed divergent color but didn't recompute contrast on OBSERVED value
8	Rounded style.css bypasses cascade	missed	variation-override precedence not empirically probed
9	Testimonial grid mobile-stacking off	missed	patterns-coverage-quota miss — only 2/5 patterns inserted
10	Archive uses post-content (not excerpt)	missed	route probed but content-depth not examined

Bonus findings beyond the answer key (17)

CRITICAL style-variations: Dark search input text invisible (color === background)
CRITICAL parts-and-mobile-nav: mobile nav overlay never displays (class-name mismatch — stronger root cause than planted)
MAJOR parts-and-mobile-nav: duplicate hamburger buttons + Escape doesn't close overlay + no <main> element + no skip link (4 separate a11y failures)
MAJOR style-variations: WCAG AA contrast failures on button hover states in ALL 3 variations; striped table imperceptible in Dark; header hardcoded white breaks Dark
MAJOR patterns: CTA banner + Hero section produce "unexpected or invalid content" Gutenberg validation errors on every insertion
MAJOR block-styles: .is-style-outlined button invisible on Default/Warm (white-on-white)
MINOR cross-feature: jQuery + jquery-migrate on every page; register_nav_menus dead classic API in block theme; customTemplates page-no-title declared-but-no-file; pricing table hardcoded #e5e7eb borders
MINOR templates: hardcoded nav URLs 404; footer #fff inline style bypasses variations; footer copyright year hardcoded
TRIVIAL: sidebar dead part

Token consumption — aggregate

	Manager (Opus 4.7)	Subagents (Sonnet 4.6)	Total
Agents	1	7 (1 recon + 6 Testers)	8
Messages	81	972	1,053
Fresh input	247	1,016	1,263
Output	106,434	108,268	214,702
Cache-create 5m	0	1,544,248	1,544,248
Cache-create 1h	158,190	0	158,190
Cache-read	60,401,575	92,773,212	153,174,787
Total tokens	60,666,446	94,425,728	155,092,174
Cost	$34.44	$35.25	$69.69

98.8% cache-read. Manager cost up from Pilot 8's $28.76 → $34.44 because this is the 9th pilot in the same long-lived Claude Code conversation — main-context-growth effect. If we started a fresh Claude Code session for Pilot 9 only, Manager cost would likely drop ~$15.

By pricing category

Category	Tokens	Cost	%
Cache-read	153,174,787	$58.03	83.3%
Cache-create 5m	1,544,248	$5.79	8.3%
Output	214,702	$4.28	6.1%
Cache-create 1h	158,190	$1.58	2.3%
Fresh input	1,263	$0.00	0.0%

Token + duration per subagent

Session	Role	Duration	Msgs	Cost
recon	scout	5m 05s	83	$3.27
templates-and-routes	Tester	7m 17s	105	$3.70
style-variations	Tester	12m 30s	156	$5.87
parts-and-mobile-nav	Tester	9m 51s	122	$3.98
patterns-and-hardcoded-colors	Tester	14m 26s	203	$7.90
block-styles-and-block-elements	Tester	15m 19s	174	$6.23
cross-feature-lifecycle	Tester	9m 48s	129	$4.29
Totals (7)		74m 16s serial	972	$35.25

Concurrent-wave compression: 6-Tester wave wallclock ~15 min (bounded by block-styles), sequential-equivalent ~69 min. Compression ratio: ~4.6×.

Notable per-Tester observations:

patterns-and-hardcoded-colors was most expensive at $7.90 / 203 msgs / 155 tool uses. Inserted + verified patterns across 3 variations + probed block validation errors.
block-styles-and-block-elements took longest at 15m 19s — WCAG contrast calculations + cascade inspection in the editor.

Cost efficiency

Denominator	Value
Total cost / planted caught (4)	$17.42 per planted bug — highest of any pilot
Total cost / all Problems (24)	$2.90 per Problem
Total cost / all PQIP items (54)	$1.29 per PQIP item

Planted recall was low (4/10), but bonus-finding yield was 17 (highest of any pilot). Cost-per-Problem is normal. Cost-per-planted is high because the theme had unusually many "realistic content UX" and "route content depth" bugs not anchored by existing probe taxonomy.

Amendment firing matrix (21 amendments)

Load-bearing amendments this pilot

Amendment	Fired?	Evidence
H (keyboard-close overlay)	✓✓✓	parts-and-mobile-nav caught mobile nav overlay with deeper root-cause than planted (class-name mismatch between the plugin's JS and core/navigation block)
G (absence-of-feature)	✓	sidebar dead-part + page-no-title phantom + classic nav_menus dead API all caught
J (blind-run denylist)	✓	Zero answer-key contamination across 6 Testers + recon. 4/6 emitted the "Amendment J: skipped..." forcing-function string
I (empirical-probe-is-mandatory)	~ partial	4 Questions all cite empirical probes. BUT Issue 7 (Warm contrast) — Tester empirically observed divergent color, filed as Question, but didn't follow up with contrast math on the observed value. "Amendment I almost-caught" — downstream depth-of-analysis missing.

Correctly did not fire (5th + 6th zero-fuel pilot respectively for these)

Amendment	Why correctly non-fired
K (default blast radius)	no destructive ops on this theme
E admin two-tab	no admin-form concurrent edit
E-ext rapid-double-submit	no destructive customer-facing write

Improvement 1 (`hypotheses_status`) — first-pilot adoption signal

4/6 sessions compliant: templates-and-routes (6 entries), style-variations (6), parts-and-mobile-nav (8), block-styles (7).

2/6 non-compliant: patterns-and-hardcoded-colors and cross-feature-lifecycle BOTH discussed hypothesis verdicts in coverage_notes prose ("H1=FALSE, H2=PARTIALLY_TRUE...") rather than emitting the structured JSON array.

Signal: the forcing function works when followed. Non-compliance pattern suggests the JSON template needs to appear literally in the Tester invocation prompt (not just described textually in tester.md). Fix: add a JSON skeleton to the Tester's invocation-prompt template under Step 7 that the Tester fills in rather than re-deriving.

Impact on recall: 2 of the 6 planted misses (issues 9 + partial 8) might have been better-surfaced if the patterns-and-cross-feature Testers' hypothesis verdicts had been reviewable in structured form. But the misses stem from coverage-quota and taxonomy-gap issues, not hypothesis-tracking — so this is not a recall-critical path.

Improvement 2 (`/meta-review`) — first-pilot calibration

Meta-review declared 0 high-severity, 5 low-severity gaps. Escape-analysis classifier found 6 misses.

Miss #	Flagged by meta-review?	Severity meta assigned
5 hero fallback overlay	✗	— (taxonomy gap: external-resource-failure)
6 pricing $ placeholders	∼	Low (framed as H6/H7, not content-UX)
7 Warm nav contrast	∼	— (Tester observed color surprise but didn't compute contrast)
8 rounded style.css cascade	✓	Low (Check 2 H19 flagged)
9 testimonial mobile-stacking	✓	Low (Check 1 + Check 9 flagged pattern-quota gap)
10 archive content/excerpt	✗	— (taxonomy gap: route-content-depth)

Calibration verdict: meta-review caught 2/6, partially caught 2/6, missed 2/6. Under-called severity (all flags "low" when some were recall-critical). Three taxonomy gaps identified for future meta-review expansion:

External-resource-failure probes (Miss 5)
Content-authoring UX (placeholder-looks-real) (Miss 6)
Route-content-depth (Miss 10)

Strongest meta-review signals: Check 1 (hypothesis coverage) + Check 9 (feature-anchor-completeness) both correctly flagged the pattern-insertion quota gap that produced Miss 9. That's the evidence meta-review's core design works — just that its check taxonomy needs extending.

Net verdict on Improvement 2: positive first-pilot signal. Catches 33-50% of actual misses via harness-internal reasoning alone (no answer key). With the 3 new checks suggested above, it should reach 70-80% coverage.

2 new amendment candidates (not auto-shipped)

Amendment L — observation-depth on passing routes

Targets Miss 5 (hero fallback) + Miss 10 (archive content/excerpt). When a charter visits a route/pattern/style and declares it passes, the report MUST include one content-level assertion (e.g., "archive page shows 3 post summaries of ≤ 55 words each" not just "200 OK + template rendered"). Forcing-function string: observation-depth: <route> → <what I empirically verified>.

Amendment M (draft) — content-authoring realism probe for patterns

Targets Miss 6 (pricing $ placeholders). When inserting a pattern, ask "would a novice admin publish this unchanged and embarrass themselves?" Narrower scope — patterns-only.

Do not auto-ship — both need human review. Amendment L feels generalizable (routes + patterns + variations all benefit); Amendment M is narrower and may merge into existing pattern probes.

block-theme-exploration skill — first validation

The skill drove theme-shape-appropriate probes that have no analog in plugin pilots:

theme.json precedence (style.css vs theme.json — where does the Dark-variation dark-text-on-dark-bg cascade actually resolve?)
Style variation token-coverage completeness (Dark + Warm add background/foreground slugs not in root palette — does the default's text: dark token persist?)
templateParts declaration × file existence (customTemplates page-no-title declared but no file — classic theme.json drift)
Block style registration × CSS fire × variation consistency (register_block_style' behavior across variations)
core/navigation + classic-nav-menus disconnection (register_nav_menus in a block theme is a dead API — probed empirically)

Skill is doing its job. One recommendation from the classifier: add an "absence-of-skip-link" probe for accessibility that didn't fire this pilot.

Cross-pilot state

9 pilots, 5 reruns, 21 amendments, 0 rule-text regressions. Ecosystem matrix is now feature-complete across the evaluation set (core × woocommerce × block-theme). The harness works on all three plugin kinds without further scaffolding.

Improvements shipped between Pilot 8 and Pilot 9 (hypotheses_status + /meta-review) both validated on this pilot. Both need minor tightening (I1: JSON template in invocation prompt; I2: 3 new checks). Neither is blocking future pilots.

Recommendation

Ship Improvement 1 tightening: add JSON template to Tester invocation prompt for hypotheses_status so non-compliance drops to 0/6.
Ship Improvement 2 tightening: 3 new meta-review checks (external-resource-failure, content-authoring UX, route-content-depth).
Human review Amendment L — observation-depth on passing routes. Targets 2 of the 6 Pilot 9 misses and generalizes to every plugin with routes/patterns/variations.
Pilot 10 choice: harness is now feature-complete against the evaluation set. Options:
- Evaluate against a real-world plugin (no answer key) to test the amendment set's generalization outside the training-set taxonomy
- Run a second-order escape analysis: replay Pilots 1-9 against the current amendment set to see if hindsight would find planted bugs earlier pilots missed
- Declare feature-complete and ship the harness as-is for real-world use

Artifacts

Final report: runs/2026-04-24T15-15-49_magellan-theme/final-report.md
Escape analysis: runs/2026-04-24T15-15-49_magellan-theme/escape-analysis.md
Meta-review (new artifact): runs/2026-04-24T15-15-49_magellan-theme/coverage-gaps.md
Token usage: runs/2026-04-24T15-15-49_magellan-theme/token-usage.json
Manifest: runs/2026-04-24T15-15-49_magellan-theme/manifest.json
6 session reports: runs/2026-04-24T15-15-49_magellan-theme/sessions/<slug>/report.json
Pilot 8 (WC #2) comparison: https://gist.github.com/alopezari/6ef1e379ed0f69cfe04274fc9d92e37a

alopezari/magellan-theme-pilot-9.md

Select an option

No results found

Select an option

No results found

Magellan Pilot 9 — magellan-theme (first block-theme pilot; first run of Improvements 1 + 2)

TL;DR — feature-complete across the evaluation matrix; meta-review calibration signal

Reliability — PQIP totals

Per-charter PQIP

Recall breakdown against the 10 planted issues

Bonus findings beyond the answer key (17)

Token consumption — aggregate

By pricing category

Token + duration per subagent

Cost efficiency

Amendment firing matrix (21 amendments)

Load-bearing amendments this pilot

Correctly did not fire (5th + 6th zero-fuel pilot respectively for these)

Improvement 1 (`hypotheses_status`) — first-pilot adoption signal

Improvement 2 (`/meta-review`) — first-pilot calibration

2 new amendment candidates (not auto-shipped)

Amendment L — observation-depth on passing routes

Amendment M (draft) — content-authoring realism probe for patterns

block-theme-exploration skill — first validation

Cross-pilot state

Recommendation

Artifacts

alopezari/magellan-theme-pilot-9.md

Magellan Pilot 9 — magellan-theme (first block-theme pilot; first run of Improvements 1 + 2)

TL;DR — feature-complete across the evaluation matrix; meta-review calibration signal

Reliability — PQIP totals

Per-charter PQIP

Recall breakdown against the 10 planted issues

Bonus findings beyond the answer key (17)

Token consumption — aggregate

By pricing category

Token + duration per subagent

Cost efficiency

Amendment firing matrix (21 amendments)

Load-bearing amendments this pilot

Correctly did not fire (5th + 6th zero-fuel pilot respectively for these)

Improvement 1 (hypotheses_status) — first-pilot adoption signal

Improvement 2 (/meta-review) — first-pilot calibration

2 new amendment candidates (not auto-shipped)

Amendment L — observation-depth on passing routes

Amendment M (draft) — content-authoring realism probe for patterns

block-theme-exploration skill — first validation

Cross-pilot state

Recommendation

Artifacts

Improvement 1 (`hypotheses_status`) — first-pilot adoption signal

Improvement 2 (`/meta-review`) — first-pilot calibration