Magellan Pilot 11 — magellan-backups (Item E first run; 10/10 clean regression-test pass; 5 process-level observations for next harness cycle)
Run ID: 2026-04-27T08-33-36_magellan-backups
Plugin: magellan-backups (8 files, 6 PHP, ~330 LoC) — backup/restore/export with cron schedule
Kind: plugin
Ecosystem: core
Driver: Chrome DevTools MCP (headed)
Source path: /tmp/magellan-plugin-src/magellan-backups/ — extracted from tests/plugins/packages/magellan-backups.zip, ISSUES.md stripped before reading (Amendment J discipline)
Dispatch: 1 recon + 7 wave Testers (auto-dispatched per Item D) + 1 supplementary post-meta-review (Sonnet-default, blind greybox)
Wallclock: ~36 min wave + 23 min supplementary
Total cost: $102.90
- Recall 10/10. First clean pass on magellan-backups in run history. 8 caught-exact, 1 caught-semantically (Issue 8 — same root cause framed as "no transaction" instead of "no integrity check before destroy"), 1 caught-bundled (Issue 10 — sibling-propagation note inside an export-side Problem rather than a discrete backup-side entry).
- First run of the Item E architecture (recon always runs; static analysis is the conditional bonus when
source_pathis readable). Recon ran in a background subagent while Manager did static analysis foreground; ~3 min wallclock saved vs serial. Architecture validated. - Pilot 10 c2 reinforcement fired cleanly on Issue 9 (the issue that motivated it). Tester used the mandated coverage-note literal verbatim and Issue 9 is now caught-exact at major severity. Chronic miss across two prior runs is now closed. The cleanest validation of the c2 amendment in project history — same plugin where the rule was originally motivated, now firing on the same issue that previously escaped.
- 0 misses, 0 amendments proposed. This run is a regression-test pass for the magellan-backups plugin under the current amendment set. Every previously-motivated rule that targets a magellan-backups-class bug fired correctly on its motivating issue.
Cross-pilot arc:
| Pilot | Shape | Model | Recall | Notes |
|---|---|---|---|---|
| 1 (backups) | artifact | Opus | 10/10 | Pilot zero — answer-key calibration |
| 2 (contact-forms) | form/email | Opus | 7/10→10/10 | — |
| 3 (members) | role/restriction | Opus | 5/10→10/10 | — |
| 4 (seo-toolkit) | metadata | Sonnet | 4/10→10/10 | — |
| 5 (pay, WC #1) | WC gateway | Sonnet | 8/10→9/10 | — |
| 6 (gallery) | file/media | Sonnet | 8/10 | — |
| 7 (speed) | caching/Time | Sonnet | 8/10 | — |
| 8 (checkout-editor, WC #2) | WC field editor | Sonnet | 6/10 | — |
| 9 (theme, block-theme) | block theme | Sonnet | 4/10 | new kind + Imp 1/2 first run |
| 11 (backups, regression) | artifact | Sonnet | 10/10 | Item E first run + c2 amendment validation |
10/10 planted + 15 bonus findings. 25 Problems, 7 Questions, 12 Improvements, 11 Praises.
Severity: 8 critical, 14 major, 3 minor.
| Charter | Priority | Type | P | Q | I | ! | Status |
|---|---|---|---|---|---|---|---|
| backup-artifact-andlist | critical | andlist | 4 | 1 | 3 | 3 | complete |
| restore-andlist | critical | andlist | 5 | 0 | 1 | 1 | complete |
| export-andlist | critical | andlist | 4 | 1 | 1 | 1 | complete |
| public-accessibility | critical | cross-feature | 1 | 1 | 1 | 1 | complete |
| schedule-and-email | high | hypothesis-cluster | 2 | 1 | 2 | 1 | complete |
| restore-data-integrity | high | hypothesis-cluster | — | — | — | — | failed (sandbox-killed) |
| backup-extras | high | hypothesis-cluster | 2 | 2 | 2 | 2 | complete |
| cross-feature-lifecycle | medium | cross-feature | — | — | — | — | parked (not dispatched) |
| backup-roundtrip-and-scale (supplementary) | high | andlist | 7 | 1 | 0 | 1 | complete |
| Totals | 25 | 7 | 12 | 11 | 7 complete + 1 failed + 1 parked |
| # | Planted issue | Verdict | Matched to |
|---|---|---|---|
| 1 | Progress bar always shows 100% | caught-exact | backup-artifact-andlist + backup-extras (double) |
| 2 | Schedule time format mismatch (12h↔24h) | caught-exact | backup-artifact-andlist + schedule-and-email |
| 3 | Notification email empty recipient (option-key typo) | caught-exact | 3 Testers; double wp_mail() bonus on supplementary |
| 4 | User export includes password hashes | caught-exact (3 Testers) | export-andlist + public-accessibility + supplementary; bonus: session_tokens, auth_key, nonce_key, secure_auth_key |
| 5 | Uploads directory missing from backup | caught-exact | supplementary (Amendment K default-blast-radius literal fired) |
| 6 | No pre-restore safety backup | caught-exact | supplementary (destructive-op AND-list b2 anchor fired) |
| 7 | Backups publicly accessible via URL | caught-exact (3 Testers) | backup-artifact-andlist + public-accessibility + export-andlist |
| 8 | Corrupt restore truncates DB | caught-semantically | supplementary — framed as "no transaction" instead of "no integrity check before destroy"; same source line + same reproducer |
| 9 | Large DB causes memory exhaustion (OOM) | caught-exact | supplementary — c2 source-pattern fallback literal fired verbatim |
| 10 | Concurrent backups corrupt zip file | caught-bundled | backup-extras — sibling-propagation note inside export-side Problem |
- CRITICAL restore-andlist: zip-slip path traversal confirmed (caught independently after restore-data-integrity was sandbox-killed — same probe via different entry point)
- CRITICAL restore-andlist: arbitrary SQL execution via uploaded zip's
database.sql(no validation, fed straight to$wpdb->query()) - CRITICAL supplementary: WP secret keys (
auth_key,secure_auth_key,nonce_key) leaked alongside password hashes via unfilteredSELECT * FROM wp_options - CRITICAL export-andlist: session tokens with active expiration timestamps leaked in users export
- MAJOR schedule-and-email: activation unconditionally schedules cron — ignores
mb_schedule_enabledflag (not in static-analysis hypotheses) - MAJOR schedule-and-email: double
wp_mail()call — one conditional (correct), one unconditional at line 74; fires to empty$torecipient when option is empty - MAJOR supplementary: postmeta NEVER exported via "Posts" selective export (
export_table($wpdb->posts, "post_type='post'")skipswp_postmetaentirely) - MAJOR backup-extras: filename collision via minute-precision
date('Y-m-d-Hi')+ZipArchive::OVERWRITE - MAJOR backup-extras: hardcoded "100%" progress at idle (independent confirmation of Issue 1)
- MINOR: menu-path divergence (admin.php vs tools.php both route, but assets only enqueue for tools.php)
| Manager (Opus 4.7) | Subagents (Sonnet 4.6) | Total | |
|---|---|---|---|
| Agents | 1 | 10 (1 recon + 8 Testers + 1 meta-review) | 11 |
| Messages | 144 | 960 | 1,104 |
| Cost | $70.81 | $32.08 | $102.90 |
| Category | Cost | % |
|---|---|---|
| Cache-read | $84.39 | 82.0% |
| Cache-create 5m | $8.34 | 8.1% |
| Output | $6.73 | 6.5% |
| Cache-create 1h | $3.38 | 3.3% |
| Fresh input | $0.05 | 0.0% |
Higher cost than Pilot 9 ($69.69 → $102.90, +47%): this run is the 11th pilot in a long-lived Claude Code session (main-context-growth effect), and added a supplementary Tester + meta-review subagent that prior pilots didn't include. A fresh Claude Code session for Pilot 11 only would likely drop Manager cost ~$25-30.
| Session | Role | Duration | Msgs | Cost |
|---|---|---|---|---|
| recon | scout | ~5m | 54 | $2.23 |
| backup-artifact-andlist | Tester | ~16m | 143 | $4.68 |
| restore-andlist | Tester | ~17m | 164 | $4.72 |
| export-andlist | Tester | ~13m | 111 | $3.29 |
| public-accessibility | Tester | ~10m | 95 | $2.45 |
| schedule-and-email | Tester | ~9m | 90 | $2.34 |
| restore-data-integrity | Tester | (killed) | 92 | $2.45 |
| backup-extras | Tester | ~9m | 95 | $2.57 |
| backup-roundtrip-and-scale (supplementary) | Tester | ~23m | 116 | $4.69 |
The change: when source_path is readable, recon and static analysis run in parallel rather than serial. Recon dispatches as a background subagent while the Manager reads source files foreground; both outputs feed into Phase 3 charter generation.
What happened: dispatch worked first time. Recon completed at ~5m wallclock; static analysis took ~6m foreground. Wave then dispatched at T+8m. Vs. serial dispatch (recon → static-analysis → wave), this saves ~3m per run.
No regressions: the surface-map → hot-hypothesis parity check (added in commit 887ae88) fired correctly. All 22 numbered hot hypotheses had hypothesis anchors in the surface map, no silent dropouts.
Verdict: ship.
The escape-analysis classifier reported 0 amendments because all 10 issues were caught. But this run surfaced process-level issues that the answer-key view doesn't see. These are observations for docs/harness-retrospectives.md, not bug-class amendments.
What happened: charter hypothesis 1 prescribed building a malicious zip at /tmp/mb-zip-slip.zip containing an entry wp-content/../../tmp/MB_PWNED.txt, then verifying whether /tmp/MB_PWNED.txt was created post-upload. Both the zip-build and the verification path cross outside the magellan workspace into /tmp/. Claude Code's security sandbox interpreted "agent writes to /tmp/" as a boundary violation and killed the agent before the report could be written. No report.json, no console-logs, empty screenshots dir.
Why it escaped the existing rules: charter authors writing path-traversal probes naturally reach for /tmp/PWNED because it's the canonical exploit demonstration. The rule the harness needs is "use a workspace-relative target path" — same bug class is confirmed by ANY landing outside the prescribed extraction directory. The malicious zip can land its escape file at wp-content/../zip-slip-target.txt (resolves to WP root, still outside wp-content, still proves the bug) without crossing the harness sandbox.
Proposed amendment (would close this miss class on any plugin):
Sandbox-aware probe path discipline. When a charter probe crosses normal workspace boundaries — path traversal, file uploads to system paths, restoring zip contents to absolute paths — the assertion target MUST be workspace-relative.
Why: Claude Code's security sandbox kills agents that write outside the workspace. The bug class (zip-slip, directory escape, unsanitized extraction) is confirmed by ANY landing outside the prescribed directory; the canonical
/tmp/PWNEDexample is convenient but unnecessary.How to apply:
- For zip-slip / path-traversal: target
wp-content/../<probe-marker>(resolves to WP root) orruns/<id>/sessions/<slug>/escape-target.txt— both inside the workspace, both prove the same vulnerability.- For file-write probes: target
runs/<id>/sessions/<slug>/probes/.- For abstraction: any probe that asserts a write should ground that write inside the run directory.
Compensating coverage: restore-andlist independently confirmed zip-slip via a different entry point. The territory wasn't lost, but the supplementary Tester had to compensate. Net cost: ~1 sandbox-killed Tester ($2.45) + 1 supplementary dispatch ($4.69) for what should have been a single ~$3 charter.
Where to file: skills/tester-mindset/SKILL.md — new section "Probe path discipline" or add to existing "Probe-tool selection" guidance.
What happened: 6 of 7 wave Testers wrote their report.json files to /Users/alopezari/Downloads/files/wp-ai-tester-v4/runs/<run-id>/sessions/<slug>/report.json instead of /Users/alopezari/Projects/automattic/magellan/runs/.... This is the Claude Code shell's bootstrap CWD, not the project directory. Manual rsync was needed post-wave to align files. The supplementary Tester (which I dispatched with absolute magellan paths spelled out in the prompt) wrote correctly the first time.
Why it happens: Tester subagent prompts pass the run directory as a relative path (runs/<run-id>/...) which resolves against the subagent's bootstrap CWD, not the Manager's CWD. The shell cd workaround in scripts/studio-provision.sh doesn't propagate to subagent dispatch because subagents start a fresh shell.
Proposed amendment: Manager dispatch prompts must include the absolute path for every write target. The Tester subagent definition (.claude/agents/tester.md) should reject relative paths and resolve them against an explicit RUN_DIR env var the Manager passes.
Where to file: .claude/agents/tester.md "Where to write outputs" section — change relative-path examples to absolute and add a forcing-function statement.
What happened: every Tester in the wave wrote hypotheses_status array entries with natural-language verdicts ("false", "not-a-bug", "not-confirmed", descriptive prose strings) instead of the schema enum (confirmed-bug | refuted | inconclusive | pass | na). All 6 reports failed schema validation. Patched via jq mapping after the fact. The supplementary Tester used the enum correctly because I spelled it out in its prompt.
Why it happens: the Tester prompt mentions "include a hypotheses_status array" but doesn't enumerate the allowed enum values. Sonnet 4.6 fills the field with whatever feels natural without checking the schema.
Proposed amendment: Tester subagent prompt must list the enum values verbatim, with examples. Add to .claude/agents/tester.md Output schema section: "status MUST be exactly one of: confirmed-bug, refuted, inconclusive, pass, na. Natural-language verdicts (e.g., false, not-a-bug) will fail validation and require manual repatch."
Where to file: .claude/agents/tester.md.
What happened: charters specified max_turns: 8 (hypothesis-cluster) and max_turns: 12 (andlist). Testers ran 90-164 messages each (4-12× the budget). The Item C wallclock-reduction promise didn't materialize because Testers don't honor max_turns from charter frontmatter — they use the Tester subagent's default 30-turn budget regardless.
Why it happens: the Tester subagent definition reads budget.max_turns only as advisory, not enforced. There's no kill-switch.
Proposed amendment: either (a) make budget.max_turns enforced (Tester counts its own turns and aborts at the cap with a partial status), OR (b) drop the charter-level max_turns field and standardize on the subagent default. Status quo (advisory but unenforced) is the worst of both.
Where to file: .claude/agents/tester.md and the charter schema.
What happened: the charter's mission was a1-a6 enumeration of zip CONTENTS (open the zip, list what's inside, classify each entry against the answer-list). The Tester filed Problems from other charters' territories (public accessibility, schedule UI bugs) instead of doing the enumeration. The supplementary charter backup-roundtrip-and-scale had to re-target the missed work.
Why it happens: andlist charters are a relatively new charter type; the Tester prompt for andlist doesn't include a forcing-function reminder that "every numbered anchor must produce a Y/N verdict in your report." Testers default to filing whatever they see, regardless of scope.
Proposed amendment: andlist charter prompts must include a literal forcing function: "Your report MUST include a section listing each anchor a1-aN with a Y/N verdict + cited evidence. Off-charter Problems may be filed but do NOT substitute for anchor enumeration." Validate at report-validation time.
Where to file: skills/sbtm/SKILL.md (charter-type definitions) + .claude/agents/tester.md (andlist-specific section).
User flagged during this run: "I think the bottlenecks are some dialogs from the browser that I had to approve manually because the agents couldn't do it." Headed Chrome surfaced OS-level permission prompts (Studio HTTPS cert trust, popup blocker, default-browser, "leave site?" beforeunload, translate bar, restore-tabs, password-manager keychain) that the Tester could not dismiss programmatically.
Fix shipped in commit 72aa07a:
- New
scripts/register-mcp-drivers.sh— registers all four MCP drivers (Playwright + Chrome DevTools, headed + headless) with the dialog-suppression flags pre-applied. Same flags for both modes. - New "Dialog & permission handling" section in both
skills/browser-driver-playwright/SKILL.mdandskills/browser-driver-chrome-devtools/SKILL.mddocumenting what's suppressed and what still requireshandle_dialog. scripts/check-deps.shnow detects when registrations are missing the new flags and points to the helper.AGENTS.mdupdated: replaces inlineclaude mcp addsnippets with helper pointer.
Flags applied (both modes):
- Studio HTTPS cert:
--ignore-https-errors/--acceptInsecureCerts - Restore-tabs / saved-state:
--isolated - Permissions:
--grant-permissions clipboard-read clipboard-write notifications geolocation(Playwright) / pre-granted at launch (Chrome DevTools) --no-default-browser-check,--no-first-run,--disable-popup-blocking,--disable-prompt-on-repost,--disable-default-apps,--disable-infobars,--password-store=basic,--use-mock-keychain--disable-features=Translate,DownloadBubble,DownloadBubbleV2,PreloadingPagePreloadHoldback
JS-level dialogs (alert/confirm/prompt/beforeunload) still appear and still need handle_dialog — they're page-level, not browser-level.
- 10/10 strict recall — first clean pass on magellan-backups in run history.
- Item E architecture validated on first run.
- 0 miss-class amendments — all prior amendments fired correctly on their motivating issues.
- 5 process-level observations for
harness-retrospectives.md: probe-path sandbox discipline, Tester CWD resolution, schema-enum drift, budget non-enforcement, andlist scope drift. - Browser dialog auto-dismiss shipped (commit
72aa07a). - Total cost: $102.90 — higher than prior pilots due to long-lived Claude Code session main-context growth + supplementary Tester + meta-review subagent additions.
This is a regression-test pass for the magellan-backups plugin under the current amendment set. No corrective action recommended on the bug-class side.
The 5 process-level observations identified above were addressed in the same harness-improvement cycle:
- Sandbox-aware probe paths →
skills/tester-mindset/SKILL.mdnew section "Probe path discipline — keep destructive probes inside the workspace" with rule + table of right/wrong targets per probe shape (zip-slip, arbitrary write, crafted-zip build, symlink escape). - Tester CWD discipline →
.claude/agents/tester.mdStep 2a (cd toMAGELLAN_REPOfirst; absolute paths for Read/Write/Edit) +.claude/commands/test-plugin.mddispatch prompt (Manager substitutes absoluteMAGELLAN_REPO=$(pwd)literally; both Tester + recon prompts updated). - Schema enum literals →
.claude/agents/tester.mdStep 7 with verbatim enum table for every enum field + a "common drift to avoid" list ("false","not-a-bug","not-confirmed", descriptive strings,"high"/"low"severities) + validation as hard pre-condition (must exit 0 before Step 8). - Budget enforcement →
.claude/agents/tester.mdStep 4 with explicit turn-counter, hard-cap stop, "do not start one more cycle past the cap" rule, and a guard against count drift (>1.5× message-count = drifted). - Andlist forcing function →
.claude/agents/tester.mdStep 1 with charter-type-specific rules — andlist anchors a1-aN are first-class hypotheses; off-charter Problems may be filed as bonus but do NOT substitute for anchor enumeration.
Plus the dialog auto-dismiss work that was already shipped (commit 72aa07a).
Validation strategy for the next pilot: re-run magellan-backups (Pilot 12) and check whether (a) Testers write to magellan repo paths first try, (b) node scripts/validate-report.mjs exits 0 on every report with no jq patches needed, (c) turns_used <= max_turns on every report, (d) andlist Tester's hypotheses_status array contains every numbered anchor.