Magellan Pilot 17 — magellan-backups cost-floor experiment

Run ID: 2026-04-28T06-31-56_magellan-backups
Date: 2026-04-28
Plugin: magellan-backups v1.0.0 (blind greybox — ISSUES.md stripped before run)
Goal: confirm cost-floor projection with all Opus off. Full model stack swap vs Pilot 15 baseline.

Model stack

Role	Pilot 15 (baseline)	Pilot 17 (this run)
Manager	Sonnet 4.6	Sonnet 4.6
Planner	Opus 4.7	Sonnet 4.6
Testers	Sonnet 4.6	Haiku 4.5
Phase 1.5	static-analysis on	disabled (recon-only via `install_path`)

Top-line results

Metric	Pilot 17	Pilot 15	Pilot 11 (10/10 peak)
Recall	9/10	10/10	10/10
Problems	22 (8C / 12M / 2m)	—	25 (8C / 14M / 3m)
Questions	1	—	7
Improvements	23	—	12
Praises	10	—	11
Total cost	~$19.30	~$55.59	$102.90
Cost reduction vs P11	−81%	−46%	baseline
Cost reduction vs P15	−65%	baseline	—
Wallclock (wave)	~17 min	—	~35 min
Sessions	7 (6 wave + 1 supplementary)	6	8
Charters	6	6	8
Driver	playwright-cli-headed (+r2: chrome-devtools-headless)	chrome-devtools-headed	chrome-devtools-headed

Cost breakdown (original wave: $17.22)

Agent	Model	Messages	Input	Output	Cache-5m	Cache-1h	Cache-read	Cost
Manager	claude-sonnet-4-6	136	9,499	84,379	0	1,057,563	12,088,625	$11.27
Planner-Sonnet	claude-sonnet-4-6	18	22	19,733	170,242	0	597,043	$1.11
Recon (tester-haiku)	claude-haiku-4-5	57	91	10,765	137,205	0	2,688,485	$0.49
Tester: restore-safety-nets (failed)	claude-haiku-4-5	39	188	3,456	80,817	0	2,048,288	$0.32
Tester: backup-artifact-andlist	claude-haiku-4-5	64	143	17,991	263,467	0	4,045,245	$0.82
Tester: export-artifact-andlist	claude-haiku-4-5	74	177	20,480	165,230	0	5,258,789	$0.83
Tester: schedule-hypothesis-cluster	claude-haiku-4-5	58	203	10,327	162,771	0	4,509,059	$0.71
Tester: concurrent-trigger-cross	claude-haiku-4-5	58	12,574	8,897	181,504	0	4,906,994	$0.77
Tester: breadth-tour	claude-haiku-4-5	70	286	13,216	190,319	0	5,766,623	$0.88
Wave total		574						$17.22

Supplementary + post-processing (~$2.08 estimated)

Agent	Model	Tokens	Tool uses	Duration	Est. cost
restore-safety-nets-r2 (tester-haiku)	claude-haiku-4-5	108,008	59	628s	~$0.87
Meta-reviewer (general-purpose)	claude-sonnet-4-6	66,889	20	150s	~$0.80
Escape-analysis classifier (general-purpose)	claude-sonnet-4-6	62,018	4	73s	~$0.41
Supplementary total					~$2.08

Grand total: ~$19.30
Cost per planted issue (9 caught): ~$2.14

Original wave cost by category

Category	Cost	% of $17.22
Fresh input	$0.04	0.2%
Output	$1.99	11.5%
Cache-create (5m)	$2.12	12.3%
Cache-create (1h)	$6.35	36.9%
Cache-read	$6.73	39.1%

Per-charter session breakdown

Charter	Type	Priority	Status	Turns	Flows	Tool uses	Tokens	Duration	Cost
backup-artifact-andlist	andlist	critical	complete	—	—	33	85,914	220s	$0.82
restore-safety-nets	andlist	critical	failed (socket EINVAL)	0	0	19	62,738	93s	$0.32
export-artifact-andlist	andlist	critical	complete	—	—	41	94,711	379s	$0.83
schedule-hypothesis-cluster	hypothesis-cluster	high	complete	—	—	31	96,666	228s	$0.71
concurrent-trigger-cross	cross-feature	high	complete	—	—	31	105,565	175s	$0.77
breadth-tour	breadth	high	complete	8/25	4/12	37	108,418	234s	$0.88
restore-safety-nets-r2 (supplementary)	andlist	critical	complete	14/15	5/6	59	108,008	628s	~$0.87

Recall: 9/10

#	Issue	Verdict	Session
1	Progress bar always shows 100%	caught-exact	breadth-tour
2	Schedule time format mismatch (12h/24h)	caught-exact	schedule-hypothesis-cluster
3	Notification email option-name typo	caught-exact	schedule-hypothesis-cluster
4	User export includes hashed passwords	caught-exact	backup-artifact-andlist + export-artifact-andlist
5	Uploads directory missing from backup	caught-exact	backup-artifact-andlist
6	No pre-restore backup	caught-exact	restore-safety-nets-r2
7	Backups publicly accessible via URL	caught-exact	backup-artifact-andlist + breadth-tour
8	Corrupt restore truncates database	caught-semantically	restore-safety-nets-r2 (deeper: no transaction + naive SQL split on `;\n`)
9	Large database causes memory exhaustion	missed	— (third consecutive miss — chronic c2 class)
10	Concurrent backups corrupt ZIP file	caught-exact	concurrent-trigger-cross

The miss: Issue 9

Root cause: Phase 1.5 static analysis was disabled (install_path, no source_path) so the planner never tagged backup features as scale-sensitive. Without that tag, the c2 forcing function didn't fire. The backup-artifact-andlist charter focused entirely on the artifact-exposure angle and never turned to scale probes — and left no c2 coverage note.

Same structural miss as Pilot 10 (explicit budget-deprioritization, no Problem filed) and Pilot 1 (under-classified as minor). Third consecutive occurrence. Amendment shipped (see below).

PQIP breakdown — 22 Problems, 1 Question, 23 Improvements, 10 Praises

Critical Problems (8)

#	Area	Title
P1	Backup storage	Backup ZIP files web-accessible without authentication
P2	Backup contents	database.sql exposes user password hashes (bcrypt)
P3	Backup access	Backup files downloadable via direct /wp-content URL
P4	Backup naming	Concurrent backups overwrite each other (minute-precision filename + no lock)
P5	Export storage	Export .sql files web-accessible without authentication
P6	Export contents	Users export includes hashed passwords (user_pass column)
P7	Restore (F3)	Upload & Restore fires without any confirmation dialog
P8	Restore SQL	SQL import has no transaction wrapper — partial failure corrupts database

Major Problems (12)

#	Area	Title
P9	Backup contents	"Full Backup" omits wp-content/uploads/
P10	Backup lifecycle	Backup files persist after plugin deactivation
P11	Backup behavior	Scheduled backup cron registered at activation without user opt-in
P12	Export contents	Options export includes WordPress security keys (auth_key, etc.)
P13	Export naming	Export filenames guessable at minute precision (collision risk)
P14	Export contents	Posts export omits wp_postmeta (custom fields not exported)
P15	Export lifecycle	Export files persist after plugin deactivation
P16	Schedule	Email notification never sent — option key mismatch (write: `_backups_email`, read: `_backup_email`)
P17	Schedule	Time dropdown stores 12h AM/PM but displays as 24h — selection never persists correctly
P18	Restore (b2)	Restore overwrites current site without creating pre-operation snapshot
P19	Restore (b7)	Upload & Restore form has no file-size validation — oversized uploads fail silently
P20	Backup behavior	Default blast radius: cron registered before user opts in

Minor Problems (2)

#	Area	Title
P21	UX	Progress bar hardcoded to 100% (purely decorative)
P22	Restore (b3)	No dry-run or preview mode for restore operations

Question (1)

Does the naive ;\n SQL split actually break on semicolons inside string literals in practice? A post titled "Hello; World" would corrupt the restore_from_zip() import midway.

Bonus findings (not in ISSUES.md, filed by Testers)

12 problems found beyond the 10 planted — 2.2× expansion factor. Notable bonus catches:

P4: Concurrent backup collision (not planted — complex cross-feature interaction)
P11: Cron auto-registration at activation without opt-in
P19: No upload file-size validation on restore form
P12: WordPress security keys exported in Options export
P14: wp_postmeta omitted from Posts export

Amendment shipped

c2 Reinforcement 3 — skills/tester-mindset/SKILL.md — commit ef3205b

Expanded trigger condition: the scale-sensitive c2 coverage-note literal is now mandatory on any charter that touches an artifact-producing OR scale-sensitive feature, regardless of the charter's primary angle. Previously, the trigger required a planner-generated scale-sensitive tag on the charter. In recon-only runs (no source_path), the planner never generates those tags, so the forcing function never fired.

Filing the source-pattern Problem (~1 turn cost) discharges the c2 requirement without a planner-generated tag. An artifact-producer charter that files Problems on access control and content correctness but carries no c2 coverage note is incomplete.

Driver incident

Original wave used playwright-cli-headed. The restore-safety-nets charter failed at driver init:

listen EINVAL: invalid argument /var/folders/.../T/pw-.../cli/34a7a4f30301be81-restore-safety-nets.sock

macOS imposes a 104-character limit on Unix socket paths; the run directory depth pushed the session socket path over the limit. The charter ran 0 of 7 probes.

Recovery: supplementary re-dispatch with chrome-devtools-headless recovered the charter. Added 5 new Problems (P7, P8, P18, P19, P22) including two criticals.

Structural fix needed: the playwright-cli driver should be avoided on macOS for runs where the run directory path is deep, or the daemon socket path should be shortened via a config override.

Cross-pilot arc — magellan-backups

Pilot	Date	Stack	Phase 1.5	Recall	Problems	Cost	Notes
1	2026-04-23	Opus Manager + Opus Testers	off	~8/10	—	~$180+	First run; Issue 9 caught but under-classified
10	2026-04-24	Sonnet Manager + Sonnet Testers	on (Item B)	9/10	—	—	Issue 9 missed (budget deprioritization, no Problem filed)
11	2026-04-27	Sonnet Manager + Opus Planner + Sonnet Testers	on (Item E)	10/10	25	$102.90	First clean pass; c2 reinforcement fired
15	—	Sonnet Manager + Opus Planner + Sonnet Testers	on	10/10	—	$55.59	Baseline for Pilot 17 head-to-head
17	2026-04-28	Sonnet Manager + Sonnet Planner + Haiku Testers	off	9/10	22	~$19.30	Issue 9 chronic miss; c2 R3 shipped

Cost-floor verdict: confirmed. Haiku+Sonnet stack achieves ~65% reduction vs Pilot 15 at 9/10 recall. Full 10/10 fidelity requires either Phase 1.5 (for planner scale-tagging) or c2 Reinforcement 3 firing (now shipped — expected to close Issue 9 on next Haiku run).

Generated by Magellan — AI-driven exploratory testing harness for WordPress plugins.
Run artifacts: runs/2026-04-28T06-31-56_magellan-backups/ (gitignored — local only)

alopezari/pilot-17-magellan-backups.md

Select an option

No results found