Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save alopezari/ad1779ced93242f3daf7f66f8b54663a to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/ad1779ced93242f3daf7f66f8b54663a to your computer and use it in GitHub Desktop.
Magellan pilot — magellan-seo-toolkit Pilot 4 rerun (10/10 after 6 amendments + 2 reinforcements; 4/10 → 10/10)

Magellan pilot — magellan-seo-toolkit Pilot 4 rerun (escape-analysis amendment validation)

Run ID: 2026-04-24T08-50-22_magellan-seo-toolkit-rerun Plugin: magellan-seo-toolkit v1.0.0 (same plugin as Pilot 4 orig) Purpose: Validate that the 6 new amendments + 2 reinforcements shipped in commit 925e2ef (proposed by /escape-analysis) close the 6 misses observed in original Pilot 4. Driver: Chrome DevTools MCP with --experimental-page-id-routing Dispatch: Concurrent Tester wave (6 charters, Sonnet-default) Wallclock: 59m 22s (includes one killed wave + retry — see "Operational learning")


TL;DR

Recall 10/10 (up from 4/10 in Pilot 4 orig). +6 issues recovered, 0 regressions. All 6 new amendments fired, both reinforcements fired. Escape-analysis loop now validated on three plugins, three perfect reruns.

Pilot Model Original Amendments shipped Rerun
Pilot 2 (contact-forms) Opus 7/10 4 (visual / a11y / charset / sibling-propagation) 10/10
Pilot 3 (members) Opus 5/10 5 (empty-state / absence / plugin-native / cross-feature / UI-path) 10/10
Pilot 4 (seo-toolkit) Sonnet 4/10 6 + 2 reinforcements (counters / state-variety / enumerate / unsaved-work / two-tab / view-source) 10/10

Reliability — PQIP totals

10/10 planted + 17 bonus findings. 27 Problems, 4 Questions, 9 Improvements, 10 Praises.

Charter Problems Questions Improvements Praises Planted caught
robots-txt-deadui 2 0 0 1 1/1 (#6)
breadcrumbs-jsonld 3 0 0 1 1/1 (#8)
meta-box-and-title-override 7 1 3 2 2/2 (#9, #10)
redirects-manager 6 1 3 2 2/2 (#5, #7) + #4 as Question
sitemap-and-scale 4 1 2 3 1/1 (#3)
seo-score-and-dashboard 5 1 1 1 2/2 (#1, #2 via adjacent)
Totals 27 4 9 10 10/10

Severity mix: 2 critical, 14 major, 11 minor, 0 trivial Problems.

Before/after on the 6 previously-missed issues

# Planted issue Pilot 4 orig Pilot 4 rerun Amendment fired
2 Char count includes HTML entities ✗ MISSED caught-exact via adjacent-symptom (multi-byte strlen/mb_strlen in same counter) A — inline counters
3 Sitemap ships password-protected posts ✗ MISSED caught-exact as major "privacy leak" B — seed state variety
4 Redirects strip query strings ✗ MISSED caught-bundled (filed as Question — classification drift) C — enumerate root-cause surface
6 Robots.txt no beforeunload ✗ MISSED caught-exact — spec-for-bug firing D — unsaved-work
9 SEO meta last-write-wins ✗ MISSED caught-exact via two-tab probe E — two-tab concurrent
10 Single-quote og:title ✗ MISSED caught-exact via view-source of raw HTML F — view-source HTML

All 4 previously-caught issues stayed caught: #1 (green badge), #5 (+2 counter), #7 (open redirect + bonus javascript: URI), #8 (invalid JSON-LD).


Token consumption — aggregate

Raw token totals for the whole run, broken down by agent role and by category. Captured by scripts/capture-run-tokens.mjs parsing Claude Code transcripts.

By agent role

Manager (Opus 4.7) Testers (Sonnet 4.6) Total
Messages 191 1,015 1,206
Fresh input 396 7,303 7,699
Output 185,259 126,162 311,421
Cache-create 5m 0 2,178,761 2,178,761
Cache-create 1h 490,046 0 490,046
Cache-read 24,202,828 77,660,601 101,863,429
Total tokens 24,878,529 79,972,827 104,851,356
Cost $21.64 $33.38 $55.02

Observations:

  • Testers push ~4× the Manager's tokens (1015 vs 191 messages) — exploration work dominates, as expected.
  • Manager uses cache-create 1h (long-lived per-run Manager context). Testers use cache-create 5m (short-lived per-Tester-session context). That's exactly what you want — the Manager's prompt stays stable across all charters so the 1h cache amortizes; each Tester is ephemeral so 5m is enough.
  • 97.2% of total tokens are cache-read — prompt caching is doing the heavy lifting on cost.

By pricing category

Category Tokens Cost % of cost
Cache-read 101,863,429 $35.40 64.3%
Cache-create 5m 2,178,761 $8.17 14.9%
Output 311,421 $6.52 11.9%
Cache-create 1h 490,046 $4.90 8.9%
Fresh input 7,699 $0.02 0.0%

Token + duration per Tester (6 successful sessions)

Each Tester session produces one report.json + screenshots + console-logs. Ordered by charter slug.

Charter Mode Duration Tool uses Msgs Input Output cc5m cr Cost
robots-txt-deadui foreground 4m 21s 46 68 74 5,463 295,405 5,086,283 $2.72
breadcrumbs-jsonld foreground 4m 52s 47 72 6,288 10,320 139,068 5,005,247 $2.20
meta-box-and-title-override background 8m 45s 83 131 137 17,546 251,280 10,979,504 $4.50
redirects-manager background 8m 22s 78 121 127 17,275 153,055 9,841,943 $3.79
sitemap-and-scale background 10m 26s 64 82 88 14,583 192,169 6,862,495 $3.00
seo-score-and-dashboard background 8m 21s 64 97 103 17,262 165,927 8,291,875 $3.37
Productive totals 45m 07s serial 382 571 6,817 82,449 1,196,904 46,067,347 $19.58

Concurrent-wave compression: 4 background Testers launched 09:40:05–09:40:21 and finished 09:48:27–09:51:36 — ~11 min wallclock for 36 min sequential work (3.3× compression). The 2 foreground were sequential so compressed nothing (~9m 13s = 9m 13s). Productive wallclock (productive Testers only, ignoring killed first wave): ~20 min.

Notable per-Tester outliers:

  • sitemap-and-scale has the longest duration (10m 26s) but fewer messages (82) — it's doing heavy bash work (seeding 10k posts, curl, xmllint) rather than LLM turns.
  • breadcrumbs-jsonld is the cheapest Tester ($2.20) despite making 47 tool calls — relatively short exploration; anchored on 3 Problems with high certainty.
  • meta-box-and-title-override is the most expensive ($4.50) — filed 7 Problems + 4 other PQIP items; Tester did the most work and the two-tab concurrent probe + view-source probe both required extra turns.

Productive vs wasted cost

Total cost was $55.02. Of that, only the 6 sessions that produced report.json files are productive.

Bucket Cost Note
Manager $21.64 one Manager run across all activity
Productive Testers (6 successful) $19.58
Wasted Testers (6 killed + 1 cancelled) $13.82 first wave watchdog reap + one foreground cancellation
Productive total $41.22 Manager + 6 successful Testers (what a clean run would cost)
Actual cost paid $55.02 includes all waste

Wastage: 25% of the run — operational mistake, not harness cost. In a clean run this goes to $0.

Cost-per-planted-bug

Denominator Value
Actual / planted caught $5.50 per bug (10 planted caught)
Productive-only / planted caught $4.12 per bug
Actual / all Problems filed $2.04 per Problem (27 Problems)
Productive / all Problems $1.53 per Problem

For comparison: Pilot 3 rerun (Opus) was ~$7 per planted bug, Pilot 3 Sonnet rerun was ~$1.83 per planted bug. Pilot 4 rerun Sonnet at $4.12 productive sits between — higher than members-rerun because seo-toolkit has more rendering work (view-source, multi-tab, sitemap at scale) vs members' tighter CRUD surface.


Operational learning — parent-idle watchdog reaps concurrent waves

First dispatch was a 6-agent concurrent background wave. All 6 were killed simultaneously by Claude Code's stream watchdog at 600s. Partial outputs showed each Tester was actively making amendment-driven findings (apostrophe-break og:title, two-tab concurrent edit, open redirect confirmation) when reaped.

Root cause was NOT a harness issue. The laptop slept while the user was at breakfast → no parent-session token stream → watchdog "Agent stalled: no progress for 600s" on all 6 background agents. The Testers themselves were fine; the parent Claude Code session that spawned them went silent.

Re-dispatch (with laptop kept awake) went clean. Saved to user memory (feedback_background_agents_laptop_sleep.md) for future pilots.


Amendment firing matrix

Amendment Intended target Fired? Where Effective?
A — inline counters Issue 2 seo-score-and-dashboard (multi-byte strlen, adjacent symptom), meta-box-and-title-override
B — state variety Issue 3 sitemap-and-scale ✓✓ spec-for-bug
C — enumerate root-cause Issue 4 ~ redirects-manager — query-strings filed as Question, not Problem (drift) ~ partial
D — unsaved-work Issue 6 robots-txt-deadui ✓✓ spec-for-bug
E — two-tab concurrent Issue 9 meta-box-and-title-override ✓✓ spec-for-bug
F — view-source HTML Issue 10 meta-box-and-title-override, breadcrumbs-jsonld
Reinforce 5 (empty-state) coverage sitemap-and-scale + 4 others via coverage-note strings
Reinforce 8 (cross-feature) coverage redirects-manager + cross-charter pattern

3 spec-for-bug clean fires (B, D, E), 3 clean-with-drift fires (A, C, F), both reinforcements visible in coverage notes. Amendment C needs tightening — UTM-parameter-stripping should default to Problem, not Question.


Sonnet validation across two plugin shapes

Plugin shape Original (Sonnet) Amendments Rerun (Sonnet)
members (role / CRUD / restriction) 5/10 9 cumulative 10/10
seo-toolkit (metadata / sitemap / redirects / rendering) 4/10 15 cumulative 10/10

Two very different plugin taxonomies. Both 10/10 on Sonnet-default with the amended harness. Sonnet stays as default. Opus remains available via per-charter model: opus for specifically high-stakes charters.


Residual polish (not load-bearing)

From the classifier's notes:

  1. Tighten Amendment C's Question-vs-Problem guidance — "feature silently drops common parameter" (UTM on redirects, _fbclid on links) is user-visible breakage and should default to Problem, not Question.
  2. Strengthen Amendment F to mandate BOTH symptom seeding AND source inspection — Tester view-sourced and found the single-quote pattern but didn't seed "It's a test" to confirm live exploit. Two independent evidence streams is more robust than one.

Neither is urgent. Neither changes the 10/10 outcome.


Loop shape

pilot → /escape-analysis → human reviews proposals
     → commit amendments → re-run → /escape-analysis (validation pass)
     → confirm 10/10 → log to docs/harness-retrospectives.md → next pilot

End-to-end automation: 5 shell commands + Agent dispatch for 7 subagents (6 Testers + 1 classifier). Human review is one paragraph of rule text per miss; ~15 minutes of reading.


Next steps

  • Pilot 5 on a further-distinct plugin shape. Candidates: magellan-pay (payment gateway — new ecosystem), magellan-gallery (media + uploads — file handling), magellan-speed (caching + perf — SFDPOT-Time surface). Pick the one most-unlike prior pilots.
  • Doc fix shipped in this same commit: .claude/commands/test-plugin.md Step 4.6 previously said "For MVP, run sequentially" — corrected to wave-dispatch guidance with enabling-commit citations and the parent-idle caveat.

Artifacts

  • Final report: runs/2026-04-24T08-50-22_magellan-seo-toolkit-rerun/final-report.md
  • Escape analysis (classifier): runs/2026-04-24T08-50-22_magellan-seo-toolkit-rerun/escape-analysis.md
  • Token usage (full detail): runs/2026-04-24T08-50-22_magellan-seo-toolkit-rerun/token-usage.json
  • Manifest: runs/2026-04-24T08-50-22_magellan-seo-toolkit-rerun/manifest.json
  • 6 session reports: runs/2026-04-24T08-50-22_magellan-seo-toolkit-rerun/sessions/<slug>/report.json
  • Pilot 4 orig gist (for comparison): https://gist.github.com/alopezari/8ae6c22a3aec5a8ac67d3a02c3eb47ef
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment