Skip to content

Instantly share code, notes, and snippets.

@alopezari
Created April 28, 2026 07:21
Show Gist options
  • Select an option

  • Save alopezari/b1afee59a2a8343b2ec88b9db822f2cf to your computer and use it in GitHub Desktop.

Select an option

Save alopezari/b1afee59a2a8343b2ec88b9db822f2cf to your computer and use it in GitHub Desktop.
Magellan Pilot 17 — magellan-backups cost-floor: Manager Sonnet + Planner Sonnet + Testers Haiku (9/10 recall, ~.30, −65% vs Pilot 15)

Magellan Pilot 17 — magellan-backups cost-floor experiment

Run ID: 2026-04-28T06-31-56_magellan-backups
Date: 2026-04-28
Plugin: magellan-backups v1.0.0 (blind greybox — ISSUES.md stripped before run)
Goal: confirm cost-floor projection with all Opus off. Full model stack swap vs Pilot 15 baseline.


Model stack

Role Pilot 15 (baseline) Pilot 17 (this run)
Manager Sonnet 4.6 Sonnet 4.6
Planner Opus 4.7 Sonnet 4.6
Testers Sonnet 4.6 Haiku 4.5
Phase 1.5 static-analysis on disabled (recon-only via install_path)

Top-line results

Metric Pilot 17 Pilot 15 Pilot 11 (10/10 peak)
Recall 9/10 10/10 10/10
Problems 22 (8C / 12M / 2m) 25 (8C / 14M / 3m)
Questions 1 7
Improvements 23 12
Praises 10 11
Total cost ~$19.30 ~$55.59 $102.90
Cost reduction vs P11 −81% −46% baseline
Cost reduction vs P15 −65% baseline
Wallclock (wave) ~17 min ~35 min
Sessions 7 (6 wave + 1 supplementary) 6 8
Charters 6 6 8
Driver playwright-cli-headed (+r2: chrome-devtools-headless) chrome-devtools-headed chrome-devtools-headed

Cost breakdown (original wave: $17.22)

Agent Model Messages Input Output Cache-5m Cache-1h Cache-read Cost
Manager claude-sonnet-4-6 136 9,499 84,379 0 1,057,563 12,088,625 $11.27
Planner-Sonnet claude-sonnet-4-6 18 22 19,733 170,242 0 597,043 $1.11
Recon (tester-haiku) claude-haiku-4-5 57 91 10,765 137,205 0 2,688,485 $0.49
Tester: restore-safety-nets (failed) claude-haiku-4-5 39 188 3,456 80,817 0 2,048,288 $0.32
Tester: backup-artifact-andlist claude-haiku-4-5 64 143 17,991 263,467 0 4,045,245 $0.82
Tester: export-artifact-andlist claude-haiku-4-5 74 177 20,480 165,230 0 5,258,789 $0.83
Tester: schedule-hypothesis-cluster claude-haiku-4-5 58 203 10,327 162,771 0 4,509,059 $0.71
Tester: concurrent-trigger-cross claude-haiku-4-5 58 12,574 8,897 181,504 0 4,906,994 $0.77
Tester: breadth-tour claude-haiku-4-5 70 286 13,216 190,319 0 5,766,623 $0.88
Wave total 574 $17.22

Supplementary + post-processing (~$2.08 estimated)

Agent Model Tokens Tool uses Duration Est. cost
restore-safety-nets-r2 (tester-haiku) claude-haiku-4-5 108,008 59 628s ~$0.87
Meta-reviewer (general-purpose) claude-sonnet-4-6 66,889 20 150s ~$0.80
Escape-analysis classifier (general-purpose) claude-sonnet-4-6 62,018 4 73s ~$0.41
Supplementary total ~$2.08

Grand total: ~$19.30
Cost per planted issue (9 caught): ~$2.14


Original wave cost by category

Category Cost % of $17.22
Fresh input $0.04 0.2%
Output $1.99 11.5%
Cache-create (5m) $2.12 12.3%
Cache-create (1h) $6.35 36.9%
Cache-read $6.73 39.1%

Per-charter session breakdown

Charter Type Priority Status Turns Flows Tool uses Tokens Duration Cost
backup-artifact-andlist andlist critical complete 33 85,914 220s $0.82
restore-safety-nets andlist critical failed (socket EINVAL) 0 0 19 62,738 93s $0.32
export-artifact-andlist andlist critical complete 41 94,711 379s $0.83
schedule-hypothesis-cluster hypothesis-cluster high complete 31 96,666 228s $0.71
concurrent-trigger-cross cross-feature high complete 31 105,565 175s $0.77
breadth-tour breadth high complete 8/25 4/12 37 108,418 234s $0.88
restore-safety-nets-r2 (supplementary) andlist critical complete 14/15 5/6 59 108,008 628s ~$0.87

Recall: 9/10

# Issue Verdict Session
1 Progress bar always shows 100% caught-exact breadth-tour
2 Schedule time format mismatch (12h/24h) caught-exact schedule-hypothesis-cluster
3 Notification email option-name typo caught-exact schedule-hypothesis-cluster
4 User export includes hashed passwords caught-exact backup-artifact-andlist + export-artifact-andlist
5 Uploads directory missing from backup caught-exact backup-artifact-andlist
6 No pre-restore backup caught-exact restore-safety-nets-r2
7 Backups publicly accessible via URL caught-exact backup-artifact-andlist + breadth-tour
8 Corrupt restore truncates database caught-semantically restore-safety-nets-r2 (deeper: no transaction + naive SQL split on ;\n)
9 Large database causes memory exhaustion missed — (third consecutive miss — chronic c2 class)
10 Concurrent backups corrupt ZIP file caught-exact concurrent-trigger-cross

The miss: Issue 9

Root cause: Phase 1.5 static analysis was disabled (install_path, no source_path) so the planner never tagged backup features as scale-sensitive. Without that tag, the c2 forcing function didn't fire. The backup-artifact-andlist charter focused entirely on the artifact-exposure angle and never turned to scale probes — and left no c2 coverage note.

Same structural miss as Pilot 10 (explicit budget-deprioritization, no Problem filed) and Pilot 1 (under-classified as minor). Third consecutive occurrence. Amendment shipped (see below).


PQIP breakdown — 22 Problems, 1 Question, 23 Improvements, 10 Praises

Critical Problems (8)

# Area Title
P1 Backup storage Backup ZIP files web-accessible without authentication
P2 Backup contents database.sql exposes user password hashes (bcrypt)
P3 Backup access Backup files downloadable via direct /wp-content URL
P4 Backup naming Concurrent backups overwrite each other (minute-precision filename + no lock)
P5 Export storage Export .sql files web-accessible without authentication
P6 Export contents Users export includes hashed passwords (user_pass column)
P7 Restore (F3) Upload & Restore fires without any confirmation dialog
P8 Restore SQL SQL import has no transaction wrapper — partial failure corrupts database

Major Problems (12)

# Area Title
P9 Backup contents "Full Backup" omits wp-content/uploads/
P10 Backup lifecycle Backup files persist after plugin deactivation
P11 Backup behavior Scheduled backup cron registered at activation without user opt-in
P12 Export contents Options export includes WordPress security keys (auth_key, etc.)
P13 Export naming Export filenames guessable at minute precision (collision risk)
P14 Export contents Posts export omits wp_postmeta (custom fields not exported)
P15 Export lifecycle Export files persist after plugin deactivation
P16 Schedule Email notification never sent — option key mismatch (write: _backups_email, read: _backup_email)
P17 Schedule Time dropdown stores 12h AM/PM but displays as 24h — selection never persists correctly
P18 Restore (b2) Restore overwrites current site without creating pre-operation snapshot
P19 Restore (b7) Upload & Restore form has no file-size validation — oversized uploads fail silently
P20 Backup behavior Default blast radius: cron registered before user opts in

Minor Problems (2)

# Area Title
P21 UX Progress bar hardcoded to 100% (purely decorative)
P22 Restore (b3) No dry-run or preview mode for restore operations

Question (1)

Does the naive ;\n SQL split actually break on semicolons inside string literals in practice? A post titled "Hello; World" would corrupt the restore_from_zip() import midway.


Bonus findings (not in ISSUES.md, filed by Testers)

12 problems found beyond the 10 planted — 2.2× expansion factor. Notable bonus catches:

  • P4: Concurrent backup collision (not planted — complex cross-feature interaction)
  • P11: Cron auto-registration at activation without opt-in
  • P19: No upload file-size validation on restore form
  • P12: WordPress security keys exported in Options export
  • P14: wp_postmeta omitted from Posts export

Amendment shipped

c2 Reinforcement 3skills/tester-mindset/SKILL.md — commit ef3205b

Expanded trigger condition: the scale-sensitive c2 coverage-note literal is now mandatory on any charter that touches an artifact-producing OR scale-sensitive feature, regardless of the charter's primary angle. Previously, the trigger required a planner-generated scale-sensitive tag on the charter. In recon-only runs (no source_path), the planner never generates those tags, so the forcing function never fired.

Filing the source-pattern Problem (~1 turn cost) discharges the c2 requirement without a planner-generated tag. An artifact-producer charter that files Problems on access control and content correctness but carries no c2 coverage note is incomplete.


Driver incident

Original wave used playwright-cli-headed. The restore-safety-nets charter failed at driver init:

listen EINVAL: invalid argument /var/folders/.../T/pw-.../cli/34a7a4f30301be81-restore-safety-nets.sock

macOS imposes a 104-character limit on Unix socket paths; the run directory depth pushed the session socket path over the limit. The charter ran 0 of 7 probes.

Recovery: supplementary re-dispatch with chrome-devtools-headless recovered the charter. Added 5 new Problems (P7, P8, P18, P19, P22) including two criticals.

Structural fix needed: the playwright-cli driver should be avoided on macOS for runs where the run directory path is deep, or the daemon socket path should be shortened via a config override.


Cross-pilot arc — magellan-backups

Pilot Date Stack Phase 1.5 Recall Problems Cost Notes
1 2026-04-23 Opus Manager + Opus Testers off ~8/10 ~$180+ First run; Issue 9 caught but under-classified
10 2026-04-24 Sonnet Manager + Sonnet Testers on (Item B) 9/10 Issue 9 missed (budget deprioritization, no Problem filed)
11 2026-04-27 Sonnet Manager + Opus Planner + Sonnet Testers on (Item E) 10/10 25 $102.90 First clean pass; c2 reinforcement fired
15 Sonnet Manager + Opus Planner + Sonnet Testers on 10/10 $55.59 Baseline for Pilot 17 head-to-head
17 2026-04-28 Sonnet Manager + Sonnet Planner + Haiku Testers off 9/10 22 ~$19.30 Issue 9 chronic miss; c2 R3 shipped

Cost-floor verdict: confirmed. Haiku+Sonnet stack achieves ~65% reduction vs Pilot 15 at 9/10 recall. Full 10/10 fidelity requires either Phase 1.5 (for planner scale-tagging) or c2 Reinforcement 3 firing (now shipped — expected to close Issue 9 on next Haiku run).


Generated by Magellan — AI-driven exploratory testing harness for WordPress plugins.
Run artifacts: runs/2026-04-28T06-31-56_magellan-backups/ (gitignored — local only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment