Run ID: 2026-04-28T06-31-56_magellan-backups
Date: 2026-04-28
Plugin: magellan-backups v1.0.0 (blind greybox — ISSUES.md stripped before run)
Goal: confirm cost-floor projection with all Opus off. Full model stack swap vs Pilot 15 baseline.
| Role | Pilot 15 (baseline) | Pilot 17 (this run) |
|---|---|---|
| Manager | Sonnet 4.6 | Sonnet 4.6 |
| Planner | Opus 4.7 | Sonnet 4.6 |
| Testers | Sonnet 4.6 | Haiku 4.5 |
| Phase 1.5 | static-analysis on | disabled (recon-only via install_path) |
| Metric | Pilot 17 | Pilot 15 | Pilot 11 (10/10 peak) |
|---|---|---|---|
| Recall | 9/10 | 10/10 | 10/10 |
| Problems | 22 (8C / 12M / 2m) | — | 25 (8C / 14M / 3m) |
| Questions | 1 | — | 7 |
| Improvements | 23 | — | 12 |
| Praises | 10 | — | 11 |
| Total cost | ~$19.30 | ~$55.59 | $102.90 |
| Cost reduction vs P11 | −81% | −46% | baseline |
| Cost reduction vs P15 | −65% | baseline | — |
| Wallclock (wave) | ~17 min | — | ~35 min |
| Sessions | 7 (6 wave + 1 supplementary) | 6 | 8 |
| Charters | 6 | 6 | 8 |
| Driver | playwright-cli-headed (+r2: chrome-devtools-headless) | chrome-devtools-headed | chrome-devtools-headed |
| Agent | Model | Messages | Input | Output | Cache-5m | Cache-1h | Cache-read | Cost |
|---|---|---|---|---|---|---|---|---|
| Manager | claude-sonnet-4-6 | 136 | 9,499 | 84,379 | 0 | 1,057,563 | 12,088,625 | $11.27 |
| Planner-Sonnet | claude-sonnet-4-6 | 18 | 22 | 19,733 | 170,242 | 0 | 597,043 | $1.11 |
| Recon (tester-haiku) | claude-haiku-4-5 | 57 | 91 | 10,765 | 137,205 | 0 | 2,688,485 | $0.49 |
| Tester: restore-safety-nets (failed) | claude-haiku-4-5 | 39 | 188 | 3,456 | 80,817 | 0 | 2,048,288 | $0.32 |
| Tester: backup-artifact-andlist | claude-haiku-4-5 | 64 | 143 | 17,991 | 263,467 | 0 | 4,045,245 | $0.82 |
| Tester: export-artifact-andlist | claude-haiku-4-5 | 74 | 177 | 20,480 | 165,230 | 0 | 5,258,789 | $0.83 |
| Tester: schedule-hypothesis-cluster | claude-haiku-4-5 | 58 | 203 | 10,327 | 162,771 | 0 | 4,509,059 | $0.71 |
| Tester: concurrent-trigger-cross | claude-haiku-4-5 | 58 | 12,574 | 8,897 | 181,504 | 0 | 4,906,994 | $0.77 |
| Tester: breadth-tour | claude-haiku-4-5 | 70 | 286 | 13,216 | 190,319 | 0 | 5,766,623 | $0.88 |
| Wave total | 574 | $17.22 |
| Agent | Model | Tokens | Tool uses | Duration | Est. cost |
|---|---|---|---|---|---|
| restore-safety-nets-r2 (tester-haiku) | claude-haiku-4-5 | 108,008 | 59 | 628s | ~$0.87 |
| Meta-reviewer (general-purpose) | claude-sonnet-4-6 | 66,889 | 20 | 150s | ~$0.80 |
| Escape-analysis classifier (general-purpose) | claude-sonnet-4-6 | 62,018 | 4 | 73s | ~$0.41 |
| Supplementary total | ~$2.08 |
Grand total: ~$19.30
Cost per planted issue (9 caught): ~$2.14
| Category | Cost | % of $17.22 |
|---|---|---|
| Fresh input | $0.04 | 0.2% |
| Output | $1.99 | 11.5% |
| Cache-create (5m) | $2.12 | 12.3% |
| Cache-create (1h) | $6.35 | 36.9% |
| Cache-read | $6.73 | 39.1% |
| Charter | Type | Priority | Status | Turns | Flows | Tool uses | Tokens | Duration | Cost |
|---|---|---|---|---|---|---|---|---|---|
| backup-artifact-andlist | andlist | critical | complete | — | — | 33 | 85,914 | 220s | $0.82 |
| restore-safety-nets | andlist | critical | failed (socket EINVAL) | 0 | 0 | 19 | 62,738 | 93s | $0.32 |
| export-artifact-andlist | andlist | critical | complete | — | — | 41 | 94,711 | 379s | $0.83 |
| schedule-hypothesis-cluster | hypothesis-cluster | high | complete | — | — | 31 | 96,666 | 228s | $0.71 |
| concurrent-trigger-cross | cross-feature | high | complete | — | — | 31 | 105,565 | 175s | $0.77 |
| breadth-tour | breadth | high | complete | 8/25 | 4/12 | 37 | 108,418 | 234s | $0.88 |
| restore-safety-nets-r2 (supplementary) | andlist | critical | complete | 14/15 | 5/6 | 59 | 108,008 | 628s | ~$0.87 |
| # | Issue | Verdict | Session |
|---|---|---|---|
| 1 | Progress bar always shows 100% | caught-exact | breadth-tour |
| 2 | Schedule time format mismatch (12h/24h) | caught-exact | schedule-hypothesis-cluster |
| 3 | Notification email option-name typo | caught-exact | schedule-hypothesis-cluster |
| 4 | User export includes hashed passwords | caught-exact | backup-artifact-andlist + export-artifact-andlist |
| 5 | Uploads directory missing from backup | caught-exact | backup-artifact-andlist |
| 6 | No pre-restore backup | caught-exact | restore-safety-nets-r2 |
| 7 | Backups publicly accessible via URL | caught-exact | backup-artifact-andlist + breadth-tour |
| 8 | Corrupt restore truncates database | caught-semantically | restore-safety-nets-r2 (deeper: no transaction + naive SQL split on ;\n) |
| 9 | Large database causes memory exhaustion | missed | — (third consecutive miss — chronic c2 class) |
| 10 | Concurrent backups corrupt ZIP file | caught-exact | concurrent-trigger-cross |
Root cause: Phase 1.5 static analysis was disabled (install_path, no source_path) so the planner never tagged backup features as scale-sensitive. Without that tag, the c2 forcing function didn't fire. The backup-artifact-andlist charter focused entirely on the artifact-exposure angle and never turned to scale probes — and left no c2 coverage note.
Same structural miss as Pilot 10 (explicit budget-deprioritization, no Problem filed) and Pilot 1 (under-classified as minor). Third consecutive occurrence. Amendment shipped (see below).
| # | Area | Title |
|---|---|---|
| P1 | Backup storage | Backup ZIP files web-accessible without authentication |
| P2 | Backup contents | database.sql exposes user password hashes (bcrypt) |
| P3 | Backup access | Backup files downloadable via direct /wp-content URL |
| P4 | Backup naming | Concurrent backups overwrite each other (minute-precision filename + no lock) |
| P5 | Export storage | Export .sql files web-accessible without authentication |
| P6 | Export contents | Users export includes hashed passwords (user_pass column) |
| P7 | Restore (F3) | Upload & Restore fires without any confirmation dialog |
| P8 | Restore SQL | SQL import has no transaction wrapper — partial failure corrupts database |
| # | Area | Title |
|---|---|---|
| P9 | Backup contents | "Full Backup" omits wp-content/uploads/ |
| P10 | Backup lifecycle | Backup files persist after plugin deactivation |
| P11 | Backup behavior | Scheduled backup cron registered at activation without user opt-in |
| P12 | Export contents | Options export includes WordPress security keys (auth_key, etc.) |
| P13 | Export naming | Export filenames guessable at minute precision (collision risk) |
| P14 | Export contents | Posts export omits wp_postmeta (custom fields not exported) |
| P15 | Export lifecycle | Export files persist after plugin deactivation |
| P16 | Schedule | Email notification never sent — option key mismatch (write: _backups_email, read: _backup_email) |
| P17 | Schedule | Time dropdown stores 12h AM/PM but displays as 24h — selection never persists correctly |
| P18 | Restore (b2) | Restore overwrites current site without creating pre-operation snapshot |
| P19 | Restore (b7) | Upload & Restore form has no file-size validation — oversized uploads fail silently |
| P20 | Backup behavior | Default blast radius: cron registered before user opts in |
| # | Area | Title |
|---|---|---|
| P21 | UX | Progress bar hardcoded to 100% (purely decorative) |
| P22 | Restore (b3) | No dry-run or preview mode for restore operations |
Does the naive ;\n SQL split actually break on semicolons inside string literals in practice? A post titled "Hello; World" would corrupt the restore_from_zip() import midway.
12 problems found beyond the 10 planted — 2.2× expansion factor. Notable bonus catches:
- P4: Concurrent backup collision (not planted — complex cross-feature interaction)
- P11: Cron auto-registration at activation without opt-in
- P19: No upload file-size validation on restore form
- P12: WordPress security keys exported in Options export
- P14: wp_postmeta omitted from Posts export
c2 Reinforcement 3 — skills/tester-mindset/SKILL.md — commit ef3205b
Expanded trigger condition: the scale-sensitive c2 coverage-note literal is now mandatory on any charter that touches an artifact-producing OR scale-sensitive feature, regardless of the charter's primary angle. Previously, the trigger required a planner-generated
scale-sensitivetag on the charter. In recon-only runs (nosource_path), the planner never generates those tags, so the forcing function never fired.Filing the source-pattern Problem (~1 turn cost) discharges the c2 requirement without a planner-generated tag. An artifact-producer charter that files Problems on access control and content correctness but carries no c2 coverage note is incomplete.
Original wave used playwright-cli-headed. The restore-safety-nets charter failed at driver init:
listen EINVAL: invalid argument /var/folders/.../T/pw-.../cli/34a7a4f30301be81-restore-safety-nets.sock
macOS imposes a 104-character limit on Unix socket paths; the run directory depth pushed the session socket path over the limit. The charter ran 0 of 7 probes.
Recovery: supplementary re-dispatch with chrome-devtools-headless recovered the charter. Added 5 new Problems (P7, P8, P18, P19, P22) including two criticals.
Structural fix needed: the playwright-cli driver should be avoided on macOS for runs where the run directory path is deep, or the daemon socket path should be shortened via a config override.
| Pilot | Date | Stack | Phase 1.5 | Recall | Problems | Cost | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 2026-04-23 | Opus Manager + Opus Testers | off | ~8/10 | — | ~$180+ | First run; Issue 9 caught but under-classified |
| 10 | 2026-04-24 | Sonnet Manager + Sonnet Testers | on (Item B) | 9/10 | — | — | Issue 9 missed (budget deprioritization, no Problem filed) |
| 11 | 2026-04-27 | Sonnet Manager + Opus Planner + Sonnet Testers | on (Item E) | 10/10 | 25 | $102.90 | First clean pass; c2 reinforcement fired |
| 15 | — | Sonnet Manager + Opus Planner + Sonnet Testers | on | 10/10 | — | $55.59 | Baseline for Pilot 17 head-to-head |
| 17 | 2026-04-28 | Sonnet Manager + Sonnet Planner + Haiku Testers | off | 9/10 | 22 | ~$19.30 | Issue 9 chronic miss; c2 R3 shipped |
Cost-floor verdict: confirmed. Haiku+Sonnet stack achieves ~65% reduction vs Pilot 15 at 9/10 recall. Full 10/10 fidelity requires either Phase 1.5 (for planner scale-tagging) or c2 Reinforcement 3 firing (now shipped — expected to close Issue 9 on next Haiku run).
Generated by Magellan — AI-driven exploratory testing harness for WordPress plugins.
Run artifacts: runs/2026-04-28T06-31-56_magellan-backups/ (gitignored — local only)