Magellan Pilot 11 — magellan-backups (Item E first run; 10/10 clean regression-test pass; 5 process-level observations for next harness cycle)

Magellan Pilot 11 — magellan-backups (Item E first run; clean 10/10 regression-test pass)

Run ID: 2026-04-27T08-33-36_magellan-backups Plugin: magellan-backups (8 files, 6 PHP, ~330 LoC) — backup/restore/export with cron schedule Kind: plugin Ecosystem: core Driver: Chrome DevTools MCP (headed) Source path: /tmp/magellan-plugin-src/magellan-backups/ — extracted from tests/plugins/packages/magellan-backups.zip, ISSUES.md stripped before reading (Amendment J discipline) Dispatch: 1 recon + 7 wave Testers (auto-dispatched per Item D) + 1 supplementary post-meta-review (Sonnet-default, blind greybox) Wallclock: ~36 min wave + 23 min supplementary Total cost: $102.90

TL;DR — 10/10 recall on the plugin where the harness was born

Recall 10/10. First clean pass on magellan-backups in run history. 8 caught-exact, 1 caught-semantically (Issue 8 — same root cause framed as "no transaction" instead of "no integrity check before destroy"), 1 caught-bundled (Issue 10 — sibling-propagation note inside an export-side Problem rather than a discrete backup-side entry).
First run of the Item E architecture (recon always runs; static analysis is the conditional bonus when source_path is readable). Recon ran in a background subagent while Manager did static analysis foreground; ~3 min wallclock saved vs serial. Architecture validated.
Pilot 10 c2 reinforcement fired cleanly on Issue 9 (the issue that motivated it). Tester used the mandated coverage-note literal verbatim and Issue 9 is now caught-exact at major severity. Chronic miss across two prior runs is now closed. The cleanest validation of the c2 amendment in project history — same plugin where the rule was originally motivated, now firing on the same issue that previously escaped.
0 misses, 0 amendments proposed. This run is a regression-test pass for the magellan-backups plugin under the current amendment set. Every previously-motivated rule that targets a magellan-backups-class bug fired correctly on its motivating issue.

Cross-pilot arc:

Pilot	Shape	Model	Recall	Notes
1 (backups)	artifact	Opus	10/10	Pilot zero — answer-key calibration
2 (contact-forms)	form/email	Opus	7/10→10/10	—
3 (members)	role/restriction	Opus	5/10→10/10	—
4 (seo-toolkit)	metadata	Sonnet	4/10→10/10	—
5 (pay, WC #1)	WC gateway	Sonnet	8/10→9/10	—
6 (gallery)	file/media	Sonnet	8/10	—
7 (speed)	caching/Time	Sonnet	8/10	—
8 (checkout-editor, WC #2)	WC field editor	Sonnet	6/10	—
9 (theme, block-theme)	block theme	Sonnet	4/10	new kind + Imp 1/2 first run
11 (backups, regression)	artifact	Sonnet	10/10	Item E first run + c2 amendment validation

Reliability — PQIP totals

10/10 planted + 15 bonus findings. 25 Problems, 7 Questions, 12 Improvements, 11 Praises.

Severity: 8 critical, 14 major, 3 minor.

Per-charter PQIP

Charter	Priority	Type	P	Q	I	!	Status
backup-artifact-andlist	critical	andlist	4	1	3	3	complete
restore-andlist	critical	andlist	5	0	1	1	complete
export-andlist	critical	andlist	4	1	1	1	complete
public-accessibility	critical	cross-feature	1	1	1	1	complete
schedule-and-email	high	hypothesis-cluster	2	1	2	1	complete
restore-data-integrity	high	hypothesis-cluster	—	—	—	—	failed (sandbox-killed)
backup-extras	high	hypothesis-cluster	2	2	2	2	complete
cross-feature-lifecycle	medium	cross-feature	—	—	—	—	parked (not dispatched)
backup-roundtrip-and-scale (supplementary)	high	andlist	7	1	0	1	complete
Totals			25	7	12	11	7 complete + 1 failed + 1 parked

Recall breakdown against the 10 planted issues

#	Planted issue	Verdict	Matched to
1	Progress bar always shows 100%	caught-exact	backup-artifact-andlist + backup-extras (double)
2	Schedule time format mismatch (12h↔24h)	caught-exact	backup-artifact-andlist + schedule-and-email
3	Notification email empty recipient (option-key typo)	caught-exact	3 Testers; double `wp_mail()` bonus on supplementary
4	User export includes password hashes	caught-exact (3 Testers)	export-andlist + public-accessibility + supplementary; bonus: session_tokens, auth_key, nonce_key, secure_auth_key
5	Uploads directory missing from backup	caught-exact	supplementary (Amendment K default-blast-radius literal fired)
6	No pre-restore safety backup	caught-exact	supplementary (destructive-op AND-list b2 anchor fired)
7	Backups publicly accessible via URL	caught-exact (3 Testers)	backup-artifact-andlist + public-accessibility + export-andlist
8	Corrupt restore truncates DB	caught-semantically	supplementary — framed as "no transaction" instead of "no integrity check before destroy"; same source line + same reproducer
9	Large DB causes memory exhaustion (OOM)	caught-exact	supplementary — c2 source-pattern fallback literal fired verbatim
10	Concurrent backups corrupt zip file	caught-bundled	backup-extras — sibling-propagation note inside export-side Problem

Bonus findings beyond the answer key (15)

CRITICAL restore-andlist: zip-slip path traversal confirmed (caught independently after restore-data-integrity was sandbox-killed — same probe via different entry point)
CRITICAL restore-andlist: arbitrary SQL execution via uploaded zip's database.sql (no validation, fed straight to $wpdb->query())
CRITICAL supplementary: WP secret keys (auth_key, secure_auth_key, nonce_key) leaked alongside password hashes via unfiltered SELECT * FROM wp_options
CRITICAL export-andlist: session tokens with active expiration timestamps leaked in users export
MAJOR schedule-and-email: activation unconditionally schedules cron — ignores mb_schedule_enabled flag (not in static-analysis hypotheses)
MAJOR schedule-and-email: double wp_mail() call — one conditional (correct), one unconditional at line 74; fires to empty $to recipient when option is empty
MAJOR supplementary: postmeta NEVER exported via "Posts" selective export (export_table($wpdb->posts, "post_type='post'") skips wp_postmeta entirely)
MAJOR backup-extras: filename collision via minute-precision date('Y-m-d-Hi') + ZipArchive::OVERWRITE
MAJOR backup-extras: hardcoded "100%" progress at idle (independent confirmation of Issue 1)
MINOR: menu-path divergence (admin.php vs tools.php both route, but assets only enqueue for tools.php)

Token consumption — aggregate

	Manager (Opus 4.7)	Subagents (Sonnet 4.6)	Total
Agents	1	10 (1 recon + 8 Testers + 1 meta-review)	11
Messages	144	960	1,104
Cost	$70.81	$32.08	$102.90

By pricing category

Category	Cost	%
Cache-read	$84.39	82.0%
Cache-create 5m	$8.34	8.1%
Output	$6.73	6.5%
Cache-create 1h	$3.38	3.3%
Fresh input	$0.05	0.0%

Higher cost than Pilot 9 ($69.69 → $102.90, +47%): this run is the 11th pilot in a long-lived Claude Code session (main-context-growth effect), and added a supplementary Tester + meta-review subagent that prior pilots didn't include. A fresh Claude Code session for Pilot 11 only would likely drop Manager cost ~$25-30.

Token + duration per subagent

Session	Role	Duration	Msgs	Cost
recon	scout	~5m	54	$2.23
backup-artifact-andlist	Tester	~16m	143	$4.68
restore-andlist	Tester	~17m	164	$4.72
export-andlist	Tester	~13m	111	$3.29
public-accessibility	Tester	~10m	95	$2.45
schedule-and-email	Tester	~9m	90	$2.34
restore-data-integrity	Tester	(killed)	92	$2.45
backup-extras	Tester	~9m	95	$2.57
backup-roundtrip-and-scale (supplementary)	Tester	~23m	116	$4.69

Item E architecture — first run validation

The change: when source_path is readable, recon and static analysis run in parallel rather than serial. Recon dispatches as a background subagent while the Manager reads source files foreground; both outputs feed into Phase 3 charter generation.

What happened: dispatch worked first time. Recon completed at ~5m wallclock; static analysis took ~6m foreground. Wave then dispatched at T+8m. Vs. serial dispatch (recon → static-analysis → wave), this saves ~3m per run.

No regressions: the surface-map → hot-hypothesis parity check (added in commit 887ae88) fired correctly. All 22 numbered hot hypotheses had hypothesis anchors in the surface map, no silent dropouts.

Verdict: ship.

What we learned that needs harness improvement

The escape-analysis classifier reported 0 amendments because all 10 issues were caught. But this run surfaced process-level issues that the answer-key view doesn't see. These are observations for docs/harness-retrospectives.md, not bug-class amendments.

Observation 1 — `restore-data-integrity` killed by Claude Code security sandbox during zip-slip probe

What happened: charter hypothesis 1 prescribed building a malicious zip at /tmp/mb-zip-slip.zip containing an entry wp-content/../../tmp/MB_PWNED.txt, then verifying whether /tmp/MB_PWNED.txt was created post-upload. Both the zip-build and the verification path cross outside the magellan workspace into /tmp/. Claude Code's security sandbox interpreted "agent writes to /tmp/" as a boundary violation and killed the agent before the report could be written. No report.json, no console-logs, empty screenshots dir.

Why it escaped the existing rules: charter authors writing path-traversal probes naturally reach for /tmp/PWNED because it's the canonical exploit demonstration. The rule the harness needs is "use a workspace-relative target path" — same bug class is confirmed by ANY landing outside the prescribed extraction directory. The malicious zip can land its escape file at wp-content/../zip-slip-target.txt (resolves to WP root, still outside wp-content, still proves the bug) without crossing the harness sandbox.

Proposed amendment (would close this miss class on any plugin):

Sandbox-aware probe path discipline. When a charter probe crosses normal workspace boundaries — path traversal, file uploads to system paths, restoring zip contents to absolute paths — the assertion target MUST be workspace-relative.

Why: Claude Code's security sandbox kills agents that write outside the workspace. The bug class (zip-slip, directory escape, unsanitized extraction) is confirmed by ANY landing outside the prescribed directory; the canonical /tmp/PWNED example is convenient but unnecessary.

How to apply:

For zip-slip / path-traversal: target wp-content/../<probe-marker> (resolves to WP root) or runs/<id>/sessions/<slug>/escape-target.txt — both inside the workspace, both prove the same vulnerability.

For file-write probes: target runs/<id>/sessions/<slug>/probes/.

For abstraction: any probe that asserts a write should ground that write inside the run directory.

Compensating coverage: restore-andlist independently confirmed zip-slip via a different entry point. The territory wasn't lost, but the supplementary Tester had to compensate. Net cost: ~1 sandbox-killed Tester ($2.45) + 1 supplementary dispatch ($4.69) for what should have been a single ~$3 charter.

Where to file: skills/tester-mindset/SKILL.md — new section "Probe path discipline" or add to existing "Probe-tool selection" guidance.

Observation 2 — Tester CWD confusion (writes to bootstrap CWD instead of magellan repo)

What happened: 6 of 7 wave Testers wrote their report.json files to /Users/alopezari/Downloads/files/wp-ai-tester-v4/runs/<run-id>/sessions/<slug>/report.json instead of /Users/alopezari/Projects/automattic/magellan/runs/.... This is the Claude Code shell's bootstrap CWD, not the project directory. Manual rsync was needed post-wave to align files. The supplementary Tester (which I dispatched with absolute magellan paths spelled out in the prompt) wrote correctly the first time.

Why it happens: Tester subagent prompts pass the run directory as a relative path (runs/<run-id>/...) which resolves against the subagent's bootstrap CWD, not the Manager's CWD. The shell cd workaround in scripts/studio-provision.sh doesn't propagate to subagent dispatch because subagents start a fresh shell.

Proposed amendment: Manager dispatch prompts must include the absolute path for every write target. The Tester subagent definition (.claude/agents/tester.md) should reject relative paths and resolve them against an explicit RUN_DIR env var the Manager passes.

Where to file: .claude/agents/tester.md "Where to write outputs" section — change relative-path examples to absolute and add a forcing-function statement.

Observation 3 — Schema enum drift in `hypotheses_status`

What happened: every Tester in the wave wrote hypotheses_status array entries with natural-language verdicts ("false", "not-a-bug", "not-confirmed", descriptive prose strings) instead of the schema enum (confirmed-bug | refuted | inconclusive | pass | na). All 6 reports failed schema validation. Patched via jq mapping after the fact. The supplementary Tester used the enum correctly because I spelled it out in its prompt.

Why it happens: the Tester prompt mentions "include a hypotheses_status array" but doesn't enumerate the allowed enum values. Sonnet 4.6 fills the field with whatever feels natural without checking the schema.

Proposed amendment: Tester subagent prompt must list the enum values verbatim, with examples. Add to .claude/agents/tester.md Output schema section: "status MUST be exactly one of: confirmed-bug, refuted, inconclusive, pass, na. Natural-language verdicts (e.g., false, not-a-bug) will fail validation and require manual repatch."

Where to file: .claude/agents/tester.md.

Observation 4 — Tester budget overrun (charter `max_turns` not honored)

What happened: charters specified max_turns: 8 (hypothesis-cluster) and max_turns: 12 (andlist). Testers ran 90-164 messages each (4-12× the budget). The Item C wallclock-reduction promise didn't materialize because Testers don't honor max_turns from charter frontmatter — they use the Tester subagent's default 30-turn budget regardless.

Why it happens: the Tester subagent definition reads budget.max_turns only as advisory, not enforced. There's no kill-switch.

Proposed amendment: either (a) make budget.max_turns enforced (Tester counts its own turns and aborts at the cap with a partial status), OR (b) drop the charter-level max_turns field and standardize on the subagent default. Status quo (advisory but unenforced) is the worst of both.

Where to file: .claude/agents/tester.md and the charter schema.

Observation 5 — Charter scope drift (`backup-artifact-andlist` did not enumerate a1-a6)

What happened: the charter's mission was a1-a6 enumeration of zip CONTENTS (open the zip, list what's inside, classify each entry against the answer-list). The Tester filed Problems from other charters' territories (public accessibility, schedule UI bugs) instead of doing the enumeration. The supplementary charter backup-roundtrip-and-scale had to re-target the missed work.

Why it happens: andlist charters are a relatively new charter type; the Tester prompt for andlist doesn't include a forcing-function reminder that "every numbered anchor must produce a Y/N verdict in your report." Testers default to filing whatever they see, regardless of scope.

Proposed amendment: andlist charter prompts must include a literal forcing function: "Your report MUST include a section listing each anchor a1-aN with a Y/N verdict + cited evidence. Off-charter Problems may be filed but do NOT substitute for anchor enumeration." Validate at report-validation time.

Where to file: skills/sbtm/SKILL.md (charter-type definitions) + .claude/agents/tester.md (andlist-specific section).

Browser permission dialog auto-dismiss — shipped

User flagged during this run: "I think the bottlenecks are some dialogs from the browser that I had to approve manually because the agents couldn't do it." Headed Chrome surfaced OS-level permission prompts (Studio HTTPS cert trust, popup blocker, default-browser, "leave site?" beforeunload, translate bar, restore-tabs, password-manager keychain) that the Tester could not dismiss programmatically.

Fix shipped in commit 72aa07a:

New scripts/register-mcp-drivers.sh — registers all four MCP drivers (Playwright + Chrome DevTools, headed + headless) with the dialog-suppression flags pre-applied. Same flags for both modes.
New "Dialog & permission handling" section in both skills/browser-driver-playwright/SKILL.md and skills/browser-driver-chrome-devtools/SKILL.md documenting what's suppressed and what still requires handle_dialog.
scripts/check-deps.sh now detects when registrations are missing the new flags and points to the helper.
AGENTS.md updated: replaces inline claude mcp add snippets with helper pointer.

Flags applied (both modes):

Studio HTTPS cert: --ignore-https-errors / --acceptInsecureCerts
Restore-tabs / saved-state: --isolated
Permissions: --grant-permissions clipboard-read clipboard-write notifications geolocation (Playwright) / pre-granted at launch (Chrome DevTools)
--no-default-browser-check, --no-first-run, --disable-popup-blocking, --disable-prompt-on-repost, --disable-default-apps, --disable-infobars, --password-store=basic, --use-mock-keychain
--disable-features=Translate,DownloadBubble,DownloadBubbleV2,PreloadingPagePreloadHoldback

JS-level dialogs (alert/confirm/prompt/beforeunload) still appear and still need handle_dialog — they're page-level, not browser-level.

Summary

10/10 strict recall — first clean pass on magellan-backups in run history.
Item E architecture validated on first run.
0 miss-class amendments — all prior amendments fired correctly on their motivating issues.
5 process-level observations for harness-retrospectives.md: probe-path sandbox discipline, Tester CWD resolution, schema-enum drift, budget non-enforcement, andlist scope drift.
Browser dialog auto-dismiss shipped (commit 72aa07a).
Total cost: $102.90 — higher than prior pilots due to long-lived Claude Code session main-context growth + supplementary Tester + meta-review subagent additions.

This is a regression-test pass for the magellan-backups plugin under the current amendment set. No corrective action recommended on the bug-class side.

Update — all 5 process observations shipped

The 5 process-level observations identified above were addressed in the same harness-improvement cycle:

Sandbox-aware probe paths → skills/tester-mindset/SKILL.md new section "Probe path discipline — keep destructive probes inside the workspace" with rule + table of right/wrong targets per probe shape (zip-slip, arbitrary write, crafted-zip build, symlink escape).
Tester CWD discipline → .claude/agents/tester.md Step 2a (cd to MAGELLAN_REPO first; absolute paths for Read/Write/Edit) + .claude/commands/test-plugin.md dispatch prompt (Manager substitutes absolute MAGELLAN_REPO=$(pwd) literally; both Tester + recon prompts updated).
Schema enum literals → .claude/agents/tester.md Step 7 with verbatim enum table for every enum field + a "common drift to avoid" list ("false", "not-a-bug", "not-confirmed", descriptive strings, "high"/"low" severities) + validation as hard pre-condition (must exit 0 before Step 8).
Budget enforcement → .claude/agents/tester.md Step 4 with explicit turn-counter, hard-cap stop, "do not start one more cycle past the cap" rule, and a guard against count drift (>1.5× message-count = drifted).
Andlist forcing function → .claude/agents/tester.md Step 1 with charter-type-specific rules — andlist anchors a1-aN are first-class hypotheses; off-charter Problems may be filed as bonus but do NOT substitute for anchor enumeration.

Plus the dialog auto-dismiss work that was already shipped (commit 72aa07a).

Validation strategy for the next pilot: re-run magellan-backups (Pilot 12) and check whether (a) Testers write to magellan repo paths first try, (b) node scripts/validate-report.mjs exits 0 on every report with no jq patches needed, (c) turns_used <= max_turns on every report, (d) andlist Tester's hypotheses_status array contains every numbered anchor.

alopezari/pilot-10-gist.md

Select an option

No results found

Select an option

No results found

Magellan Pilot 11 — magellan-backups (Item E first run; clean 10/10 regression-test pass)

TL;DR — 10/10 recall on the plugin where the harness was born

Reliability — PQIP totals

Per-charter PQIP

Recall breakdown against the 10 planted issues

Bonus findings beyond the answer key (15)

Token consumption — aggregate

By pricing category

Token + duration per subagent

Item E architecture — first run validation

What we learned that needs harness improvement

Observation 1 — `restore-data-integrity` killed by Claude Code security sandbox during zip-slip probe

Observation 2 — Tester CWD confusion (writes to bootstrap CWD instead of magellan repo)

Observation 3 — Schema enum drift in `hypotheses_status`

Observation 4 — Tester budget overrun (charter `max_turns` not honored)

Observation 5 — Charter scope drift (`backup-artifact-andlist` did not enumerate a1-a6)

Browser permission dialog auto-dismiss — shipped

Summary

Update — all 5 process observations shipped

alopezari/pilot-10-gist.md

Magellan Pilot 11 — magellan-backups (Item E first run; clean 10/10 regression-test pass)

TL;DR — 10/10 recall on the plugin where the harness was born

Reliability — PQIP totals

Per-charter PQIP

Recall breakdown against the 10 planted issues

Bonus findings beyond the answer key (15)

Token consumption — aggregate

By pricing category

Token + duration per subagent

Item E architecture — first run validation

What we learned that needs harness improvement

Observation 1 — restore-data-integrity killed by Claude Code security sandbox during zip-slip probe

Observation 2 — Tester CWD confusion (writes to bootstrap CWD instead of magellan repo)

Observation 3 — Schema enum drift in hypotheses_status

Observation 4 — Tester budget overrun (charter max_turns not honored)

Observation 5 — Charter scope drift (backup-artifact-andlist did not enumerate a1-a6)

Browser permission dialog auto-dismiss — shipped

Summary

Update — all 5 process observations shipped

Observation 1 — `restore-data-integrity` killed by Claude Code security sandbox during zip-slip probe

Observation 3 — Schema enum drift in `hypotheses_status`

Observation 4 — Tester budget overrun (charter `max_turns` not honored)

Observation 5 — Charter scope drift (`backup-artifact-andlist` did not enumerate a1-a6)