Skip to content

Instantly share code, notes, and snippets.

@alopezari
Last active April 27, 2026 10:23
Show Gist options
  • Select an option

  • Save alopezari/3e90dcd482be890a8e7f8125eba5aa92 to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/3e90dcd482be890a8e7f8125eba5aa92 to your computer and use it in GitHub Desktop.
Magellan Pilot 11 — magellan-backups (Item E first run; 10/10 clean regression-test pass; 5 process-level observations now shipped)

Magellan Pilot 11 — magellan-backups (Item E first run; 10/10 clean regression-test pass; 5 process-level observations for next harness cycle)

Magellan Pilot 11 — magellan-backups (Item E first run; clean 10/10 regression-test pass)

Run ID: 2026-04-27T08-33-36_magellan-backups Plugin: magellan-backups (8 files, 6 PHP, ~330 LoC) — backup/restore/export with cron schedule Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headed) Source path: /tmp/magellan-plugin-src/magellan-backups/ — extracted from tests/plugins/packages/magellan-backups.zip, ISSUES.md stripped before reading (Amendment J discipline) Dispatch: 1 recon + 7 wave Testers (auto-dispatched per Item D) + 1 supplementary post-meta-review (Sonnet-default, blind greybox) Wallclock: ~36 min wave + 23 min supplementary Total cost: $102.90


TL;DR — 10/10 recall on the plugin where the harness was born

  • Recall 10/10. First clean pass on magellan-backups in run history. 8 caught-exact, 1 caught-semantically (Issue 8 — same root cause framed as "no transaction" instead of "no integrity check before destroy"), 1 caught-bundled (Issue 10 — sibling-propagation note inside an export-side Problem rather than a discrete backup-side entry).
  • First run of the Item E architecture (recon always runs; static analysis is the conditional bonus when source_path is readable). Recon ran in a background subagent while Manager did static analysis foreground; ~3 min wallclock saved vs serial. Architecture validated.
  • Pilot 10 c2 reinforcement fired cleanly on Issue 9 (the issue that motivated it). Tester used the mandated coverage-note literal verbatim and Issue 9 is now caught-exact at major severity. Chronic miss across two prior runs is now closed. The cleanest validation of the c2 amendment in project history — same plugin where the rule was originally motivated, now firing on the same issue that previously escaped.
  • 0 misses, 0 amendments proposed. This run is a regression-test pass for the magellan-backups plugin under the current amendment set. Every previously-motivated rule that targets a magellan-backups-class bug fired correctly on its motivating issue.

Cross-pilot arc:

Pilot Shape Model Recall Notes
1 (backups) artifact Opus 10/10 Pilot zero — answer-key calibration
2 (contact-forms) form/email Opus 7/10→10/10
3 (members) role/restriction Opus 5/10→10/10
4 (seo-toolkit) metadata Sonnet 4/10→10/10
5 (pay, WC #1) WC gateway Sonnet 8/10→9/10
6 (gallery) file/media Sonnet 8/10
7 (speed) caching/Time Sonnet 8/10
8 (checkout-editor, WC #2) WC field editor Sonnet 6/10
9 (theme, block-theme) block theme Sonnet 4/10 new kind + Imp 1/2 first run
11 (backups, regression) artifact Sonnet 10/10 Item E first run + c2 amendment validation

Reliability — PQIP totals

10/10 planted + 15 bonus findings. 25 Problems, 7 Questions, 12 Improvements, 11 Praises.

Severity: 8 critical, 14 major, 3 minor.

Per-charter PQIP

Charter Priority Type P Q I ! Status
backup-artifact-andlist critical andlist 4 1 3 3 complete
restore-andlist critical andlist 5 0 1 1 complete
export-andlist critical andlist 4 1 1 1 complete
public-accessibility critical cross-feature 1 1 1 1 complete
schedule-and-email high hypothesis-cluster 2 1 2 1 complete
restore-data-integrity high hypothesis-cluster failed (sandbox-killed)
backup-extras high hypothesis-cluster 2 2 2 2 complete
cross-feature-lifecycle medium cross-feature parked (not dispatched)
backup-roundtrip-and-scale (supplementary) high andlist 7 1 0 1 complete
Totals 25 7 12 11 7 complete + 1 failed + 1 parked

Recall breakdown against the 10 planted issues

# Planted issue Verdict Matched to
1 Progress bar always shows 100% caught-exact backup-artifact-andlist + backup-extras (double)
2 Schedule time format mismatch (12h↔24h) caught-exact backup-artifact-andlist + schedule-and-email
3 Notification email empty recipient (option-key typo) caught-exact 3 Testers; double wp_mail() bonus on supplementary
4 User export includes password hashes caught-exact (3 Testers) export-andlist + public-accessibility + supplementary; bonus: session_tokens, auth_key, nonce_key, secure_auth_key
5 Uploads directory missing from backup caught-exact supplementary (Amendment K default-blast-radius literal fired)
6 No pre-restore safety backup caught-exact supplementary (destructive-op AND-list b2 anchor fired)
7 Backups publicly accessible via URL caught-exact (3 Testers) backup-artifact-andlist + public-accessibility + export-andlist
8 Corrupt restore truncates DB caught-semantically supplementary — framed as "no transaction" instead of "no integrity check before destroy"; same source line + same reproducer
9 Large DB causes memory exhaustion (OOM) caught-exact supplementary — c2 source-pattern fallback literal fired verbatim
10 Concurrent backups corrupt zip file caught-bundled backup-extras — sibling-propagation note inside export-side Problem

Bonus findings beyond the answer key (15)

  • CRITICAL restore-andlist: zip-slip path traversal confirmed (caught independently after restore-data-integrity was sandbox-killed — same probe via different entry point)
  • CRITICAL restore-andlist: arbitrary SQL execution via uploaded zip's database.sql (no validation, fed straight to $wpdb->query())
  • CRITICAL supplementary: WP secret keys (auth_key, secure_auth_key, nonce_key) leaked alongside password hashes via unfiltered SELECT * FROM wp_options
  • CRITICAL export-andlist: session tokens with active expiration timestamps leaked in users export
  • MAJOR schedule-and-email: activation unconditionally schedules cron — ignores mb_schedule_enabled flag (not in static-analysis hypotheses)
  • MAJOR schedule-and-email: double wp_mail() call — one conditional (correct), one unconditional at line 74; fires to empty $to recipient when option is empty
  • MAJOR supplementary: postmeta NEVER exported via "Posts" selective export (export_table($wpdb->posts, "post_type='post'") skips wp_postmeta entirely)
  • MAJOR backup-extras: filename collision via minute-precision date('Y-m-d-Hi') + ZipArchive::OVERWRITE
  • MAJOR backup-extras: hardcoded "100%" progress at idle (independent confirmation of Issue 1)
  • MINOR: menu-path divergence (admin.php vs tools.php both route, but assets only enqueue for tools.php)

Token consumption — aggregate

Manager (Opus 4.7) Subagents (Sonnet 4.6) Total
Agents 1 10 (1 recon + 8 Testers + 1 meta-review) 11
Messages 144 960 1,104
Cost $70.81 $32.08 $102.90

By pricing category

Category Cost %
Cache-read $84.39 82.0%
Cache-create 5m $8.34 8.1%
Output $6.73 6.5%
Cache-create 1h $3.38 3.3%
Fresh input $0.05 0.0%

Higher cost than Pilot 9 ($69.69 → $102.90, +47%): this run is the 11th pilot in a long-lived Claude Code session (main-context-growth effect), and added a supplementary Tester + meta-review subagent that prior pilots didn't include. A fresh Claude Code session for Pilot 11 only would likely drop Manager cost ~$25-30.


Token + duration per subagent

Session Role Duration Msgs Cost
recon scout ~5m 54 $2.23
backup-artifact-andlist Tester ~16m 143 $4.68
restore-andlist Tester ~17m 164 $4.72
export-andlist Tester ~13m 111 $3.29
public-accessibility Tester ~10m 95 $2.45
schedule-and-email Tester ~9m 90 $2.34
restore-data-integrity Tester (killed) 92 $2.45
backup-extras Tester ~9m 95 $2.57
backup-roundtrip-and-scale (supplementary) Tester ~23m 116 $4.69

Item E architecture — first run validation

The change: when source_path is readable, recon and static analysis run in parallel rather than serial. Recon dispatches as a background subagent while the Manager reads source files foreground; both outputs feed into Phase 3 charter generation.

What happened: dispatch worked first time. Recon completed at ~5m wallclock; static analysis took ~6m foreground. Wave then dispatched at T+8m. Vs. serial dispatch (recon → static-analysis → wave), this saves ~3m per run.

No regressions: the surface-map → hot-hypothesis parity check (added in commit 887ae88) fired correctly. All 22 numbered hot hypotheses had hypothesis anchors in the surface map, no silent dropouts.

Verdict: ship.


What we learned that needs harness improvement

The escape-analysis classifier reported 0 amendments because all 10 issues were caught. But this run surfaced process-level issues that the answer-key view doesn't see. These are observations for docs/harness-retrospectives.md, not bug-class amendments.

Observation 1 — restore-data-integrity killed by Claude Code security sandbox during zip-slip probe

What happened: charter hypothesis 1 prescribed building a malicious zip at /tmp/mb-zip-slip.zip containing an entry wp-content/../../tmp/MB_PWNED.txt, then verifying whether /tmp/MB_PWNED.txt was created post-upload. Both the zip-build and the verification path cross outside the magellan workspace into /tmp/. Claude Code's security sandbox interpreted "agent writes to /tmp/" as a boundary violation and killed the agent before the report could be written. No report.json, no console-logs, empty screenshots dir.

Why it escaped the existing rules: charter authors writing path-traversal probes naturally reach for /tmp/PWNED because it's the canonical exploit demonstration. The rule the harness needs is "use a workspace-relative target path" — same bug class is confirmed by ANY landing outside the prescribed extraction directory. The malicious zip can land its escape file at wp-content/../zip-slip-target.txt (resolves to WP root, still outside wp-content, still proves the bug) without crossing the harness sandbox.

Proposed amendment (would close this miss class on any plugin):

Sandbox-aware probe path discipline. When a charter probe crosses normal workspace boundaries — path traversal, file uploads to system paths, restoring zip contents to absolute paths — the assertion target MUST be workspace-relative.

Why: Claude Code's security sandbox kills agents that write outside the workspace. The bug class (zip-slip, directory escape, unsanitized extraction) is confirmed by ANY landing outside the prescribed directory; the canonical /tmp/PWNED example is convenient but unnecessary.

How to apply:

  • For zip-slip / path-traversal: target wp-content/../<probe-marker> (resolves to WP root) or runs/<id>/sessions/<slug>/escape-target.txt — both inside the workspace, both prove the same vulnerability.
  • For file-write probes: target runs/<id>/sessions/<slug>/probes/.
  • For abstraction: any probe that asserts a write should ground that write inside the run directory.

Compensating coverage: restore-andlist independently confirmed zip-slip via a different entry point. The territory wasn't lost, but the supplementary Tester had to compensate. Net cost: ~1 sandbox-killed Tester ($2.45) + 1 supplementary dispatch ($4.69) for what should have been a single ~$3 charter.

Where to file: skills/tester-mindset/SKILL.md — new section "Probe path discipline" or add to existing "Probe-tool selection" guidance.

Observation 2 — Tester CWD confusion (writes to bootstrap CWD instead of magellan repo)

What happened: 6 of 7 wave Testers wrote their report.json files to /Users/alopezari/Downloads/files/wp-ai-tester-v4/runs/<run-id>/sessions/<slug>/report.json instead of /Users/alopezari/Projects/automattic/magellan/runs/.... This is the Claude Code shell's bootstrap CWD, not the project directory. Manual rsync was needed post-wave to align files. The supplementary Tester (which I dispatched with absolute magellan paths spelled out in the prompt) wrote correctly the first time.

Why it happens: Tester subagent prompts pass the run directory as a relative path (runs/<run-id>/...) which resolves against the subagent's bootstrap CWD, not the Manager's CWD. The shell cd workaround in scripts/studio-provision.sh doesn't propagate to subagent dispatch because subagents start a fresh shell.

Proposed amendment: Manager dispatch prompts must include the absolute path for every write target. The Tester subagent definition (.claude/agents/tester.md) should reject relative paths and resolve them against an explicit RUN_DIR env var the Manager passes.

Where to file: .claude/agents/tester.md "Where to write outputs" section — change relative-path examples to absolute and add a forcing-function statement.

Observation 3 — Schema enum drift in hypotheses_status

What happened: every Tester in the wave wrote hypotheses_status array entries with natural-language verdicts ("false", "not-a-bug", "not-confirmed", descriptive prose strings) instead of the schema enum (confirmed-bug | refuted | inconclusive | pass | na). All 6 reports failed schema validation. Patched via jq mapping after the fact. The supplementary Tester used the enum correctly because I spelled it out in its prompt.

Why it happens: the Tester prompt mentions "include a hypotheses_status array" but doesn't enumerate the allowed enum values. Sonnet 4.6 fills the field with whatever feels natural without checking the schema.

Proposed amendment: Tester subagent prompt must list the enum values verbatim, with examples. Add to .claude/agents/tester.md Output schema section: "status MUST be exactly one of: confirmed-bug, refuted, inconclusive, pass, na. Natural-language verdicts (e.g., false, not-a-bug) will fail validation and require manual repatch."

Where to file: .claude/agents/tester.md.

Observation 4 — Tester budget overrun (charter max_turns not honored)

What happened: charters specified max_turns: 8 (hypothesis-cluster) and max_turns: 12 (andlist). Testers ran 90-164 messages each (4-12× the budget). The Item C wallclock-reduction promise didn't materialize because Testers don't honor max_turns from charter frontmatter — they use the Tester subagent's default 30-turn budget regardless.

Why it happens: the Tester subagent definition reads budget.max_turns only as advisory, not enforced. There's no kill-switch.

Proposed amendment: either (a) make budget.max_turns enforced (Tester counts its own turns and aborts at the cap with a partial status), OR (b) drop the charter-level max_turns field and standardize on the subagent default. Status quo (advisory but unenforced) is the worst of both.

Where to file: .claude/agents/tester.md and the charter schema.

Observation 5 — Charter scope drift (backup-artifact-andlist did not enumerate a1-a6)

What happened: the charter's mission was a1-a6 enumeration of zip CONTENTS (open the zip, list what's inside, classify each entry against the answer-list). The Tester filed Problems from other charters' territories (public accessibility, schedule UI bugs) instead of doing the enumeration. The supplementary charter backup-roundtrip-and-scale had to re-target the missed work.

Why it happens: andlist charters are a relatively new charter type; the Tester prompt for andlist doesn't include a forcing-function reminder that "every numbered anchor must produce a Y/N verdict in your report." Testers default to filing whatever they see, regardless of scope.

Proposed amendment: andlist charter prompts must include a literal forcing function: "Your report MUST include a section listing each anchor a1-aN with a Y/N verdict + cited evidence. Off-charter Problems may be filed but do NOT substitute for anchor enumeration." Validate at report-validation time.

Where to file: skills/sbtm/SKILL.md (charter-type definitions) + .claude/agents/tester.md (andlist-specific section).


Browser permission dialog auto-dismiss — shipped

User flagged during this run: "I think the bottlenecks are some dialogs from the browser that I had to approve manually because the agents couldn't do it." Headed Chrome surfaced OS-level permission prompts (Studio HTTPS cert trust, popup blocker, default-browser, "leave site?" beforeunload, translate bar, restore-tabs, password-manager keychain) that the Tester could not dismiss programmatically.

Fix shipped in commit 72aa07a:

  • New scripts/register-mcp-drivers.sh — registers all four MCP drivers (Playwright + Chrome DevTools, headed + headless) with the dialog-suppression flags pre-applied. Same flags for both modes.
  • New "Dialog & permission handling" section in both skills/browser-driver-playwright/SKILL.md and skills/browser-driver-chrome-devtools/SKILL.md documenting what's suppressed and what still requires handle_dialog.
  • scripts/check-deps.sh now detects when registrations are missing the new flags and points to the helper.
  • AGENTS.md updated: replaces inline claude mcp add snippets with helper pointer.

Flags applied (both modes):

  • Studio HTTPS cert: --ignore-https-errors / --acceptInsecureCerts
  • Restore-tabs / saved-state: --isolated
  • Permissions: --grant-permissions clipboard-read clipboard-write notifications geolocation (Playwright) / pre-granted at launch (Chrome DevTools)
  • --no-default-browser-check, --no-first-run, --disable-popup-blocking, --disable-prompt-on-repost, --disable-default-apps, --disable-infobars, --password-store=basic, --use-mock-keychain
  • --disable-features=Translate,DownloadBubble,DownloadBubbleV2,PreloadingPagePreloadHoldback

JS-level dialogs (alert/confirm/prompt/beforeunload) still appear and still need handle_dialog — they're page-level, not browser-level.


Summary

  • 10/10 strict recall — first clean pass on magellan-backups in run history.
  • Item E architecture validated on first run.
  • 0 miss-class amendments — all prior amendments fired correctly on their motivating issues.
  • 5 process-level observations for harness-retrospectives.md: probe-path sandbox discipline, Tester CWD resolution, schema-enum drift, budget non-enforcement, andlist scope drift.
  • Browser dialog auto-dismiss shipped (commit 72aa07a).
  • Total cost: $102.90 — higher than prior pilots due to long-lived Claude Code session main-context growth + supplementary Tester + meta-review subagent additions.

This is a regression-test pass for the magellan-backups plugin under the current amendment set. No corrective action recommended on the bug-class side.

Update — all 5 process observations shipped

The 5 process-level observations identified above were addressed in the same harness-improvement cycle:

  1. Sandbox-aware probe pathsskills/tester-mindset/SKILL.md new section "Probe path discipline — keep destructive probes inside the workspace" with rule + table of right/wrong targets per probe shape (zip-slip, arbitrary write, crafted-zip build, symlink escape).
  2. Tester CWD discipline.claude/agents/tester.md Step 2a (cd to MAGELLAN_REPO first; absolute paths for Read/Write/Edit) + .claude/commands/test-plugin.md dispatch prompt (Manager substitutes absolute MAGELLAN_REPO=$(pwd) literally; both Tester + recon prompts updated).
  3. Schema enum literals.claude/agents/tester.md Step 7 with verbatim enum table for every enum field + a "common drift to avoid" list ("false", "not-a-bug", "not-confirmed", descriptive strings, "high"/"low" severities) + validation as hard pre-condition (must exit 0 before Step 8).
  4. Budget enforcement.claude/agents/tester.md Step 4 with explicit turn-counter, hard-cap stop, "do not start one more cycle past the cap" rule, and a guard against count drift (>1.5× message-count = drifted).
  5. Andlist forcing function.claude/agents/tester.md Step 1 with charter-type-specific rules — andlist anchors a1-aN are first-class hypotheses; off-charter Problems may be filed as bonus but do NOT substitute for anchor enumeration.

Plus the dialog auto-dismiss work that was already shipped (commit 72aa07a).

Validation strategy for the next pilot: re-run magellan-backups (Pilot 12) and check whether (a) Testers write to magellan repo paths first try, (b) node scripts/validate-report.mjs exits 0 on every report with no jq patches needed, (c) turns_used <= max_turns on every report, (d) andlist Tester's hypotheses_status array contains every numbered anchor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment