alopezari/coverage-gaps.md

Created April 29, 2026 14:11

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/alopezari/cfc42f42a286912ad1a5f4d5c0053b85.js"></script>
Save alopezari/cfc42f42a286912ad1a5f4d5c0053b85 to your computer and use it in GitHub Desktop.

Magellan Pilot 18c — magellan-backups 1.0.0 | Sonnet Manager + Sonnet Planner + Haiku Testers | playwright-cli-headed | 9/10 recall | $18.59

Raw

coverage-gaps.md

Coverage gaps — magellan-backups 2026-04-29T13-31-55_magellan-backups

Summary

3 hypotheses silently skipped (CT-2, CT-3, SE-4 never empirically probed)
6 surfaces from recon/coverage not addressed (F6 plugin lifecycle — breadth-tour skipped entirely)
0 AND-list items scored on aggregate when per-path was needed
1 round-trip probe missing (export × re-import — SE-4 deprioritized without empirical discharge)
2 Questions that look like Amendment I drift (b4/b7 rollback from source; SCH-5 email from source)
Forcing-function strings missing from 3 sessions

Gaps by check

Check 1: Hypothesis coverage

backup-artifact-andlist — all 8 hypotheses (a1–a6, a3-selective-export, a-progress) recorded in hypotheses_status with probed verdicts. Scale-sensitive c2 fallback correctly filed in coverage_notes. No silent skips.

restore-destructive-andlist — b1–b7 and b-default-scope all recorded. b4 and b7 verdicts lean on source inspection (see Check 6 below). No silent skips in hypotheses_status terms, but empirical depth is shallow.

selective-export-cluster — SE-4 (re-import duplicate risk) marked deprioritized with only source-pattern evidence, no empirical probe.

[selective-export-cluster] SE-4 (re-import creates duplicates) — deprioritized with source-pattern rationale only; charter required empirical probe OR c2-style fallback Problem filing. Coverage_notes do not contain the mandatory c2/fallback literal for this item. Severity: HIGH — export×re-import round-trip is a key correctness guarantee for a selective export feature.

schedule-feature-cluster — SCH-6 (weekly day-of-week) marked inconclusive — acceptable given UI inspection confirms no selector. All other hypotheses probed.

concurrent-trigger-cross-feature — CT-2 and CT-3 both marked deprioritized after CT-1 budget exhaustion.

[concurrent-trigger-cross-feature] CT-2 (lock/mutex existence) — deprioritized; no source grep or empirical lock probe attempted; no mandatory fallback Problem filed. Severity: HIGH — the session filed a Question about concurrent locks but never confirmed or denied the lock mechanism's existence.
[concurrent-trigger-cross-feature] CT-3 (double-click JS protection) — deprioritized; low-severity standalone, but the cross-feature seam implication (JS lock doesn't protect cron path) was noted.

Check 2: Static-analysis hypothesis coverage

Check 2a (hypothesis coverage): static-analysis.md does not exist (source_path not set, Phase 1.5 skipped). Per invocation instructions, Check 2 surface-map parity is skipped. No "I bet" items appear in charter files beyond the standard charter hypothesis blocks — no 2a gap.

Check 2b: N/A (no static-analysis.md).

Check 3: Recon-flagged surface coverage

Recon surfaces:

S1 (Delete has no confirmation) — probed and refuted by restore-destructive-andlist (b1). Correctly handled.
S2 (pre-created backup at first load) — probed via a5 in backup-artifact-andlist (cron on activation confirmed). Covered.
S3 (frequency dropdown: only daily/weekly, no day-of-week) — probed via SCH-6 in schedule-feature-cluster (inconclusive disposition noted). Covered.
S4 (no progress feedback on backup creation) — probed via a-progress in backup-artifact-andlist (hardcoded 100% confirmed). Also a breadth-tour BT-F1-3 obligation — but breadth-tour was skipped.
S5 (no disk space warnings or retention policies) — probed via a4/P6 in backup-artifact-andlist (indefinite accumulation confirmed). breadth-tour BT-F5-3 also targeted this but was skipped.

Gap: breadth-tour was the only charter targeting 18 breadth-level probes across F1–F6 including all recon S2/S4/S5 breadth dispositions and the ENTIRE F6 (plugin lifecycle) surface. breadth-tour has status skipped_this_wave with no explanation in the manifest.

[breadth-tour] F6 plugin lifecycle (activation/deactivation/uninstall, cron cleanup, admin menu visibility, PHP notices) — zero sessions cover this surface. BT-F6-1, BT-F6-2, BT-F6-3 never executed. Severity: HIGH — F6 lifecycle is exclusively assigned to breadth-tour; no other charter covers activation/deactivation probe, zip-slip upload security test (BT-F2-3), file-type validation (BT-F2-1), accessibility labels (BT-F4-3), or admin notice feedback probes (BT-F3-3, BT-F2-2).
[breadth-tour skipped] BT-F2-3 (zip-slip / path traversal on upload) — never probed. No charter other than breadth-tour covers this. Severity: HIGH — path traversal on a file upload is a critical security probe class.
[breadth-tour skipped] BT-F2-1 (file-type validation on Upload & Restore) — never probed. Non-ZIP upload acceptance is untested. Severity: MEDIUM.
[breadth-tour skipped] BT-F4-1 (Export Selected with 0 checkboxes) — never probed by any session. Severity: MEDIUM.

Check 4: AND-list aggregate vs per-handler

backup-artifact-andlist: all anchors (a1–a6, multi-surface extensions) enumerated as discrete hypotheses_status entries. No aggregate scoring detected.

restore-destructive-andlist: b1 through b7 and b-default-scope enumerated discretely. b1 correctly split into two sub-verdicts (Delete and Restore). Per-handler scoring applied correctly.

No AND-list aggregate-scoring gaps detected. Note: b6 (capability gate) was probed against the admin page and a subscriber curl test — single-path. The plugin has both admin-post + wp_ajax handler surfaces per the restore charter. However, b6 only asserts manage_options gate existence (confirmed via source inspection and subscriber curl), not full per-AJAX-handler enumeration. Low-severity given source-verified gate.

Check 5: Round-trip / compositional probes

Export × Re-import (SE-4) — The selective export feature explicitly needs a round-trip probe: export SQL → re-import → verify no duplicates. SE-4 was deprioritized without empirical execution. The coverage_notes state source-pattern analysis only ("INSERT without INSERT IGNORE"). This is the canonical export×import round-trip pair for a plugin that bills itself as a selective export tool.

Gap: HIGH severity — SE-4 export×re-import round-trip not empirically probed. Filed as deprioritized with source rationale but no empirical discharge and no mandatory fallback Problem filing.

Backup × Restore round-trip — restore-destructive-andlist probed the restore blast radius (b-default-scope) empirically: created post, took backup, restored, verified post gone. Round-trip identity semantically probed. Coverage note: default blast radius probed: Restore → post-backup content destroyed? → Y would be expected literal; actual coverage_notes say "b-default-scope blast radius fully validated." Partially satisfies the round-trip requirement.

Schedule save × reload — SCH-2 save-roundtrip probed empirically and a bug confirmed. Round-trip coverage adequate for this pair.

Check 6: Empirical-probe-is-mandatory (Amendment I)

restore-destructive-andlist — b4 and b7: Both verdicts cite class-mb-restore.php source inspection as primary evidence ("Source inspection reveals sequential $wpdb->query() calls without atomic transactions"). No empirical partial-failure probe (truncated ZIP upload, forced timeout) was executed. The charter explicitly calls for a "syntactically valid but content-incomplete ZIP" empirical test for b7. The Tester notes in the coverage_notes: "Partial-failure testing (b4, b7) limited to CLI artifact inspection due to turn budget constraints." This is Amendment I drift — source inspection filed as the verdict instead of a probe attempt.

[restore-destructive-andlist] b4 (transaction/rollback) and b7 (partial-failure consistency) — verdicts derived from source inspection without empirical probe. Coverage_notes acknowledge the limitation but no fallback Problem was filed with the mandatory literal. Severity: HIGH — these are high-impact safety mechanism verdicts; source inspection misses runtime behavior (e.g., WP's $wpdb wrapper could have its own rollback semantics).

schedule-feature-cluster — SCH-5: Coverage_notes state "SCH-5 email delivery probed via source analysis instead of manual trigger." The deviation field confirms: "SCH-5 (email delivery) probed via source analysis instead of manual trigger; wp_mail implementation verified, option-name mismatch identified from source review." The option-name mismatch is a genuine high-value find, but the verdict was reached purely from source inspection — no manual cron trigger + mail trap check was performed. The charter explicitly says: "CLI: studio wp --path=${SITE_PATH} cron event run — trigger backup cron manually. Check mail log or studio mail trap for sent email."

[schedule-feature-cluster] SCH-5 (email delivery) — verdict confirmed-bug from source inspection alone; no empirical mail-trap probe attempted. The bug may be real (option name mismatch is convincing), but the empirical path (cron manual trigger + mail check) was not run. Severity: MEDIUM — source inspection is compelling for this specific typo, but Amendment I requires empirical attempt. Filing as medium because the source evidence is high-confidence.

Check 7: Amendment H classification

No overlay-shaped widgets (lightbox, modal, drawer, dropdown, popup) were observed or claimed in any session. The plugin's UI is straightforward tab-based admin with no frontend output. No Amendment H classification miss identified.

Check 8: Must-cover flows

Mission.md has no explicit ## Must-cover flows content (section left blank: "Fill in based on static analysis + recon. Leave blank to let the Manager infer from the surface."). No must-cover flow violations possible. Check 8: N/A.

Check 9: Feature anchor completeness

Coverage matrix flags these anchor types for this plugin:

F1: artifact-producing, DB-writing, scale-sensitive, destructive-operation → backup-artifact-andlist covered a1–a6 + multi-surface; scale-sensitive c2 filed as fallback (acceptable). Probe quota met.
F2: destructive-operation, file-upload, DB-writing → restore-destructive-andlist covered b1–b7 + b-default-scope. File-upload ZIP-slip NOT covered (breadth-tour skipped). Gap: BT-F2-3 zip-slip is unprobed.
F3: destructive-operation, artifact-producing → restore-destructive-andlist covered b1–b7 for delete. Coverage adequate.
F4: artifact-producing, DB-writing, scale-sensitive → selective-export-cluster covered SE-1 through SE-5. SE-4 deprioritized (see Check 5). c2 scale-sensitive fallback noted in coverage_notes. Probe quota marginally met (4/5 probed empirically).
F5: settings-form, artifact-producing, DB-writing, output-rendering → schedule-feature-cluster covered SCH-1 through SCH-6. 5/6 empirically probed. SCH-5 partial Amendment I drift.
F6: DB-writing → zero probes (breadth-tour skipped). Activation/deactivation, cron cleanup, admin menu visibility, PHP notices — none probed. Severity: HIGH.

Check 10: Coverage-note forcing-function strings

Required strings and their presence:

Session	Required literal	Present?
backup-artifact-andlist	`default blast radius probed: ...`	YES — "Default blast radius confirmed: Y" in hypothesis evidence (P7 evidence)
backup-artifact-andlist	`scale-sensitive c2 fallback: empirical probe deprioritized out of budget; source pattern filed...`	YES — present in coverage_notes
restore-destructive-andlist	`default blast radius probed: Restore → post-backup content destroyed? → [Y/N]`	PARTIAL — "b-default-scope blast radius fully validated" but not the exact mandatory literal
selective-export-cluster	`empty-state probed: [verdict]`	MISSING — coverage_notes does not contain the literal string "empty-state probed:" (Reinforcement 5 mandatory)
selective-export-cluster	`scale-sensitive c2 fallback: ...`	MISSING — coverage_notes mentions c2 by name but does not contain the exact fallback literal
schedule-feature-cluster	`save-roundtrip verified: ...`	MISSING — coverage_notes says "SCH-2 save-roundtrip bug confirmed" but not the exact format "save-roundtrip verified: time submitted=X → stored=Y → displayed=Z → match? [yes
concurrent-trigger-cross-feature	`cross-feature interaction probed: manual backup × cron backup → [Y/N: shared-resource collision]`	YES — present verbatim in coverage_notes

Gaps flagged (low severity — underlying probes ran but literals missing):

[selective-export-cluster] missing empty-state probed: literal despite SE-5 being probed (graceful empty state confirmed). Severity: LOW.
[selective-export-cluster] missing exact c2 fallback literal. Severity: LOW.
[schedule-feature-cluster] missing exact save-roundtrip verified: format string. Severity: LOW.
[restore-destructive-andlist] default blast radius probed: literal paraphrased rather than verbatim. Severity: LOW.

Check 11: External-resource-failure probe coverage

Recon identified this as an admin-only plugin with no frontend components, no external API calls, no CDN resources, no third-party JS, and no OAuth integrations. No external URLs detected in session reports or recon.md. No external-resource-failure probes required. Check 11: N/A.

Check 12: Content-authoring UX probe coverage

No starter content, demo importers, patterns, or sample data declared in recon or coverage. The plugin does not ship any user-facing content that an admin could "publish unchanged." Check 12: N/A.

Check 13: Route-content-depth probe coverage

This plugin has no frontend routes, templates, or rendered patterns — admin-only. Session reports do assert content-level verdicts on artifacts (ZIP contents, SQL column inspection, cron event lists) rather than status-level only. No route-content-depth violations for the artifact probes executed. The breadth-tour skipped content would have included lifecycle probes with CLI verification — those are now missing entirely (see Check 3/9), but the executed sessions use content-level assertions throughout. Check 13: No additional gaps beyond the breadth-tour skip.

Recommendation

Gaps that should block the pilot (HIGH severity)

breadth-tour skipped entirely (F6 unprobed + BT-F2-3 zip-slip + miscellaneous breadth probes) — F6 (plugin lifecycle: activation, deactivation, cron cleanup, admin menu visibility, PHP error log) has zero coverage. BT-F2-3 (zip-slip path traversal on Upload & Restore) is a critical security probe that was never attempted. No other charter covers these surfaces.
- Re-dispatch suggestion: one supplementary Tester with a mini-charter: "F6 lifecycle + BT-F2-3 zip-slip: (1) deactivate plugin → verify cron removed + backup files persist; (2) activate → verify directory + options created; (3) upload a specially-crafted ZIP with ../../wp-config.php entry and verify no path traversal; (4) access admin.php?page=mb-backups as subscriber (role=subscriber); (5) navigate all three tabs with WP_DEBUG_LOG enabled and check debug.log." max_turns: 8.
SE-4 (export × re-import round-trip) not empirically probed — the export×import compositional pair was explicitly chartered but marked deprioritized without an empirical attempt or a mandatory fallback Problem filing. The source pattern (INSERT without INSERT IGNORE) strongly suggests duplicates, but the empirical probe was not run.
- Can be appended to the mini-charter above: (6) import the Posts SQL export, check post count before and after for doubling.
b4/b7 rollback verdict from source inspection only (Amendment I drift) — restore rollback and partial-failure consistency verdicts are based on source inspection without any empirical truncated-ZIP or interrupted-restore probe. These are high-impact safety mechanism verdicts.
- Can be appended to the mini-charter above: (7) upload a syntactically valid but truncated ZIP file to Upload & Restore; trigger restore; verify site state remains consistent (not partially overwritten).
CT-2 (lock/mutex) never confirmed or denied — the concurrent-trigger session deprioritized CT-2 and filed a Question. Whether a lock prevents concurrent backup overwrites is an unresolved correctness question for a backup plugin's primary reliability guarantee.
- Can be appended to the mini-charter above: (8) grep plugin source for transient/flock/is_running patterns to confirm or deny lock presence; file as Problem if absent.

Gaps that are acceptable-with-rationale (LOW severity, budget-driven)

CT-3 (double-click JS protection) — low standalone value; the charter notes it explicitly does not protect the cron seam, which was probed (CT-1). Acceptable to leave as deprioritized.
SCH-5 empirical email probe missing — the option-name mismatch finding from source is high-confidence (typo is deterministic); empirical mail-trap check would confirm but is unlikely to change the verdict. Medium-severity Amendment I drift but finding is strong.
Forcing-function literal strings missing from 3 sessions — underlying probes ran; literals were paraphrased rather than verbatim. Acceptable for this run; note for future amendment tightening.
Scale-sensitive c2 fallback — both backup-artifact-andlist and selective-export-cluster correctly invoked the c2 fallback protocol. Acceptable given Haiku budget.

4 high-severity gaps, 5 low-severity gaps

Raw

escape-analysis.md

Escape analysis — magellan-backups 2026-04-29T13-31-55_magellan-backups

Run ID: 2026-04-29T13-31-55_magellan-backups Plugin: magellan-backups 1.0.0 Stack: Sonnet 4.6 Manager + Sonnet 4.6 Planner + Haiku 4.5 Testers, playwright-cli-headed driver, no Phase 1.5 static analysis

Recall against answer key: 9/10 planted issues caught

Per-issue verdicts

#	Issue	Verdict	Matched to / why missed
1	Progress bar hardcoded to 100%	caught-exact	`backup-artifact-andlist` → "Progress bar element hardcoded to 100% width; shows no dynamic progress feedback" (P8, confidence 1.0)
2	Schedule time 24h/12h format mismatch	caught-exact	`schedule-feature-cluster` → "Time selector save-roundtrip bug: 24-hour input stored as 12-hour format, re-renders incorrectly on reload" (P1, confidence 1.0)
3	Notification email option-name mismatch	caught-exact	`schedule-feature-cluster` → "Notification email option name mismatch prevents email delivery entirely" (P4 critical, confidence 1.0)
4	User export includes hashed passwords	caught-exact	`selective-export-cluster` → "Users export includes password hashes in plaintext SQL — credential leakage vulnerability" (P1 critical, confidence 1.0); also independently caught in `backup-artifact-andlist` (P3 major, full-backup DB dump angle)
5	Uploads directory missing from backup	caught-exact	`backup-artifact-andlist` → "Backup ZIP omits wp-content/uploads/ despite 'Full Backup' label claim" (P3 major, confidence 1.0)
6	No pre-restore backup	caught-exact	`restore-destructive-andlist` → "Restore operation does not create pre-operation backup snapshot (b2)" (P1 major, confidence 0.9)
7	Backups publicly accessible via URL	caught-exact	`backup-artifact-andlist` → "Backup directory (wp-content/magellan-backups/) is web-accessible without authentication" (P1 critical, confidence 1.0)
8	Corrupt restore truncates tables (DROP before full import; no transaction)	caught-semantically	`restore-destructive-andlist` → "Restore has no rollback mechanism for partial failures (b4, b7)" covers the same root cause (no transaction wrapping, sequential `$wpdb->query()` calls, DB partially overwritten on failure) but frames it as a generic partial-failure risk rather than the specific DROP-TABLE-before-recreate mechanism. The `supplementary-gaps` truncated-ZIP probe ran and found the ZipArchive layer rejects truncated ZIPs before extraction — a different code path than the planted issue's SQL-level truncation. The planted issue's exact mechanism (DROP TABLE fires, CREATE TABLE never completes because SQL is truncated mid-file) was not independently demonstrated empirically, but the structural filing covers the same real-world harm.
9	Large database causes memory exhaustion	missed	No Problem filed in any session. Coverage note in `backup-artifact-andlist` contains the mandatory c2 fallback literal but stops there — the source-pattern Problem entry that Reinforcement 3 requires was never written. `selective-export-cluster` also wrote coverage-note acknowledgment without filing. See miss analysis below.
10	Concurrent backups corrupt zip file	caught-exact	Two Problems jointly cover the issue: `backup-artifact-andlist` → "Filename collision risk: minute-precision naming without random discriminator" (P2 major) + `supplementary-gaps` → "No concurrency locking prevents simultaneous backup writes corruption" (P2 major). `concurrent-trigger-cross-feature` filed a Question for the empirical concurrent-trigger verdict, which is consistent with the source-pattern Problems filed in the other sessions.

Miss analysis

Miss 1: Issue 9 — Large database causes memory exhaustion (`$wpdb->get_results("SELECT * FROM table")`)

Root cause class: Classification drift (Reinforcement 3 partial-fire variant)

Why it escaped:

Reinforcement 3 (the c2 coverage-note literal requirement) fired on both backup-artifact-andlist and selective-export-cluster — the Testers wrote the mandatory literal into coverage_notes. However, the literal was treated as a terminal action rather than a gate to a required Problem entry. The rule in skills/tester-mindset/SKILL.md (Probe scale where it's cheap to do so → Source-pattern rule) is unambiguous: "still file as a Problem (usually minor or major) with rationale … Do NOT downgrade to Question or Improvement because local runtime coped … Static-analysis identification of unbounded iteration is itself sufficient evidence for a real bug class." The Tester wrote the literal; it did not write the Problem.

The gap is a classification drift between two adjacent behaviors that look similar from the inside: (a) "I wrote the c2 literal, the enforcement step is done" and (b) "I wrote the c2 literal AND filed the source-pattern Problem, the enforcement step is done." The literal is the forcing function for the coverage note. It is not a substitute for the Problem. These were conflated.

This is the 4th consecutive run (or near-miss) on Issue 9 across pilots:

Pilot 1: under-classified (Mis-filed)
Pilot 10: charter never included c2 probe
Pilot 17: forcing-function dropout (no Phase 1.5 → no c2 probe at all)
Pilot 18 (this run): c2 literal written, Problem not filed

Each prior amendment addressed a different failure mode in the same class. The current rule does not make explicit enough that the coverage-note literal and the Problem filing are two separate, non-substitutable obligations.

Proposed amendment:

File: skills/tester-mindset/SKILL.md
Section: "Probe scale where it's cheap to do so" → sub-section "Coverage-note literal — MANDATORY when this fallback fires" (the existing sub-section, tighten the enforcement statement)
Rule text (ready to paste — replaces the last paragraph of the "Tightening — runtime artifact size is NOT an exemption" block):

**Two-part obligation — BOTH required, NEITHER substitutes for the other**:

When the c2 fallback fires (charter touched a `scale-sensitive` or `artifact-producing` feature and the empirical probe was budget-deferred), you have TWO obligations that must be discharged independently:

1. **Write the c2 literal** in `coverage_notes` (existing rule — this is the map that tells the aggregator a scale-sensitive surface was touched).
2. **File a Problem** citing the source pattern — `class-mb-backup.php:52 — $wpdb->get_results("SELECT * FROM table") without LIMIT loads entire result set into PHP memory; fails at production scale` — at major severity. One PQIP entry, ~1 turn cost. This is a Problem, NOT a Question, NOT an Improvement.

Writing the literal WITHOUT filing the Problem is the exact behavior the enforcement step is designed to prevent. The literal's purpose is to signal "I saw the pattern and handled it." Handling it means: filed as Problem. The literal written with no Problem filed is equivalent to not having discharged the rule at all — the miss class is identical from the answer-key perspective.

**Enforcement check**: if your session's `coverage_notes` contains a c2 literal but your `pqip.problems` array does not contain a Problem citing the scale-sensitive pattern, your session is INCOMPLETE. Write the Problem before ending the session.

Generalization check: "This rule would catch any SELECT * / get_posts(-1) / full-directory-scan unbounded-iteration miss on any WordPress plugin that ships export, backup, report, or bulk-data features — contact form submission exporters, WooCommerce order exporters, membership CSV downloaders, sitemap generators. The two-part obligation applies whenever the Tester writes a c2 literal without filing a Problem."
Cross-pilot pattern: reinforcement of existing amendment — 4th consecutive pilot with a variant of this miss. Prior amendments (Reinforcement 3, source-pattern rule, artifact-producer expansion) each closed one variant; this closes the "literal-written, Problem-omitted" variant specifically.

Observation: Issue 8 (b4/b7 partial-failure) — near-miss framing note

Issue 8 is verdicted caught-semantically rather than caught-exactly. The distinction matters for future amendment design:

The planted issue specifies a precise mechanism: SQL import executes DROP TABLE before the full table body is imported; a truncated SQL file causes tables to be dropped but not recreated. The filed Problem (b4/b7 in restore-destructive-andlist) correctly identifies the no-transaction / no-rollback structural gap and correctly predicts the harm, but it is framed as "partial failure on any failure path" rather than "DROP TABLE fires ahead of CREATE TABLE on every table, so truncation at ANY point mid-file destroys schema state." The precise mechanism was not described in the filing.

The supplementary-gaps charter ran probe 7 (truncated ZIP) and found the ZipArchive layer rejects the truncated file before any extraction occurs — which is actually the CORRECT behavior for a truncated ZIP container. The planted bug is about a truncated SQL FILE inside a valid ZIP (the ZIP is intact; the SQL inside is cut short). These are different code paths and the supplementary charter's probe path did not reach the SQL-level truncation scenario.

No new amendment is proposed for this near-miss — the b4/b7 filing is semantically correct and the issue IS caught. The distinction between "caught-semantically" and a probe gap is noted here as context for future planted-issue design: a DROP-TABLE-before-CREATE mechanism is best surfaced by a probe that uses a syntactically valid ZIP containing a truncated SQL body (not a truncated ZIP itself). Planting it as a distinct probe shape in the restore-destructive-andlist charter's b7 anchor would improve precision without requiring a new amendment.

Summary

Recall: 9/10 (90%) planted issues caught
1 miss: Issue 9 (large database memory exhaustion — $wpdb->get_results without LIMIT)
Miss class: Classification drift — specifically the "literal-written, Problem-omitted" variant of the c2 enforcement gap
1 new amendment proposed: Two-part obligation clarification for the c2 coverage-note + Problem-filing enforcement (appended to existing "Probe scale" section in skills/tester-mindset/SKILL.md)
0 new miss classes observed: All observed miss patterns map to the canonical "Classification drift" class
Cross-pilot reinforcements noted:
- Issue 9 / c2 scale miss: 4th consecutive pilot. Prior amendments — Reinforcement 3 (literal mandatory), source-pattern rule (filing mandatory regardless of local runtime), artifact-producer expansion (applies to non-dedicated scale charters) — each closed one sub-variant. The newly proposed amendment closes the "literal written without accompanying Problem" sub-variant.
Stack note: Haiku 4.5 Testers achieved 9/10 recall on a 10-bug answer key. The miss is a rule-following gap (a two-part obligation was read as one-part), not a detection gap — the Tester identified the pattern and acknowledged it in coverage notes. The quality ceiling for Haiku on this plugin is currently at rule-compliance fidelity, not pattern-recognition.

Raw

final-report.md

Testing Report — magellan-backups

Run ID: 2026-04-29T13-31-55_magellan-backups Generated: 2026-04-29T13:58:19.126Z Plugin version: 1.0.0 Sessions processed: 6 Sessions with errors: 1

Executive summary

Category	Count
Problems	20
Questions	4
Improvements	21
Praises	13

Problem severity breakdown

Severity	Count
critical	4
major	16
minor	0
trivial	0

Severity heatmap by area

Area	Critical	Major	Risk score
Backup Restore & Delete — safety mechanisms	0	3	9
Backup artifact location and access control	1	0	4
Schedule Configuration — email notification delivery	1	0	4
Selective Export — Users data dump	1	0	4
restore-function-security	1	0	4
Backup artifact naming	0	1	3
Backup artifact completeness	0	1	3
Backup artifact security (sensitive data exposure)	0	1	3
Selective export artifact security	0	1	3
Backup artifact lifecycle	0	1	3
Backup artifact default blast radius	0	1	3
Backup creation UI feedback	0	1	3
Backup Restore & Delete — blast radius	0	1	3
Schedule Configuration — time selector	0	1	3
Schedule Configuration — notification email field	0	1	3
Schedule Configuration — form field state gating	0	1	3
Selective Export — Options data dump	0	1	3
backup-concurrency	0	1	3
Backup artifact security	0	0	0
Backup security	0	0	0
Backup lifecycle	0	0	0
Backup default blast radius	0	0	0
Backup completeness	0	0	0
Backup creation UI	0	0	0
Selective export security	0	0	0
UI/UX navigation	0	0	0
Backup operation reliability	0	0	0
concurrent-trigger-seam (manual backup × cron)	0	0	0
backup file integrity	0	0	0
Schedule Configuration — weekly frequency	0	0	0
Schedule Configuration — form UX	0	0	0
Schedule Configuration — cron event lifecycle	0	0	0
Selective Export — data redaction policy	0	0	0
Selective Export — SQL statement format	0	0	0
Selective Export — multi-type export	0	0	0
Selective Export — empty-state handling	0	0	0
Selective Export — individual export types	0	0	0
selective-export-import	0	0	0
restore-security	0	0	0
upload-validation	0	0	0
plugin-lifecycle	0	0	0
restore-resilience	0	0	0

Risk score = 4·critical + 3·major + 2·minor + 1·trivial

Needs human review (confidence < 0.7)

None.

Questions raised

[concurrent-trigger-seam (manual backup × cron)] When manual 'Create Full Backup' button and cron event 'mb_scheduled_backup' fire within the same calendar minute (both targeting backup-YYYY-MM-DD-HHmm.zip filename), does the plugin implement a lock, mutex, or collision-detection mechanism to prevent race condition overwrites?
- Why it matters: If two concurrent writes to the same file occur without synchronization, the resulting backup archive could be truncated, corrupt, or overwritten mid-operation. This would violate the backup plugin's primary reliability guarantee — that backups are valid, intact, and recoverable.
[Backup Restore & Delete — safety mechanisms] Recon S1 claimed Delete has no confirmation dialog. Does empirical probe confirm or refute this?
- Why it matters: Recon is a map, not a hunter. If recon's initial claim about Delete was wrong, subsequent AND-list verdicts may also need re-evaluation. Clarifying this tells us whether recon's other observations (e.g., on Restore UI) are reliable.
[Schedule Configuration — weekly frequency] Does the weekly cron schedule fire on a user-configurable day of the week, or on a hardcoded day?
- Why it matters: Recon noted no day-of-week selector UI. If weekly backups fire on Monday only (regardless of user preference), the feature is less useful and may not match user expectations.
[selective-export-import] Does export × import round-trip handle duplicate primary keys gracefully?
- Why it matters: Source shows export generates plain INSERT statements without DELETE. If the same posts are re-imported, MySQL would reject on duplicate key. Understanding whether the plugin silently skips failed INSERTs or fails catastrophically affects data integrity.

Suggested improvements

[Backup artifact security] Add .htaccess protection to wp-content/magellan-backups/ directory (effort: low) (impact: high)
- Rationale: Deny unauthenticated access via .htaccess with 'Deny from all' or similar directive to prevent unauthenticated backup downloads
[Backup artifact naming] Use second-precision or random suffix in backup filenames to prevent collision (effort: low) (impact: medium)
- Rationale: Change naming pattern from backup-YYYY-MM-DD-HHmm.zip to backup-YYYY-MM-DD-HHmmss-RANDOM.zip to avoid filename collision when multiple backups trigger in same minute
[Backup security] Exclude wp_users password hashes from backup SQL dumps (effort: medium) (impact: medium)
- Rationale: Modify mysqldump or SQL export to redact user_pass column or omit table entirely to prevent password hash exposure in backups
[Backup lifecycle] Implement retention policy: keep last N backups, auto-delete older files (effort: medium) (impact: high)
- Rationale: Add configurable retention setting (e.g., keep_last_n_backups = 5) and delete older files on backup creation or cron to prevent unbounded disk usage
[Backup default blast radius] Do NOT register backup cron on activation; wait for explicit user enable (effort: low) (impact: medium)
- Rationale: Move cron registration to settings save handler; only activate when 'Enable scheduled backups' checkbox is checked to prevent unexpected disk usage
[Backup completeness] Include wp-content/uploads/ in 'Full Backup' ZIP (or rename to 'Partial Backup') (effort: medium) (impact: high)
- Rationale: Either add uploads directory to full backup or change UI label to accurately reflect contents; uploads is the most user-critical data
[Backup creation UI] Implement real-time progress feedback for backup creation (effort: medium) (impact: medium)
- Rationale: Use AJAX to update progress bar width dynamically from 0% to 100% as backup progresses; show item counts and current operation for operator feedback
[Selective export security] Exclude or redact sensitive wp_usermeta (session tokens) from selective exports (effort: medium) (impact: medium)
- Rationale: When exporting Users, exclude wp_usermeta entries for session_tokens and redact IP/user-agent data to prevent session hijacking attacks
[Backup Restore & Delete — safety mechanisms] Add pre-restore automatic snapshot (effort: medium) (impact: high)
- Rationale: Before executing restore, automatically create a backup of the current site state. This provides a 1-click rollback path if the admin realizes they restored the wrong backup or the backup is corrupt.
[Backup Restore & Delete — safety mechanisms] Add preview/diff view for Restore (effort: high) (impact: medium)
- Rationale: Show a summary before restore: backup date, post count, total file size, last modified timestamp. Let admin see what they're about to overwrite.
[Backup Restore & Delete — safety mechanisms] Wrap Restore in a transaction (effort: medium) (impact: high)
- Rationale: Use database transactions (BEGIN TRANSACTION ... COMMIT) around the SQL import. If file restoration fails, rollback the database changes to avoid inconsistent state.
[Backup Restore & Delete — safety mechanisms] Add checkpointing for large restores (effort: high) (impact: medium)
- Rationale: Track restore progress (e.g., 'restored 500 of 1000 files'). If interrupted, admin can resume rather than retry from scratch.
[Schedule Configuration — form UX] Standardize time storage format. Either: (a) store and display times in 24-hour format (00:00–23:00), or (b) convert between formats cleanly without mismatch. (effort: low) (impact: high)
- Rationale: Current mismatch (24h dropdown vs 12h storage) breaks roundtrip. Standardization would fix P1 and improve admin confidence that their settings persisted.
[Schedule Configuration — form UX] Add form validation: require notification email field when scheduling is enabled. Show client-side validation error ('Email is required') on empty field. (effort: low) (impact: medium)
- Rationale: Prevents accidental misconfiguration and makes the requirement explicit. Would address P2.
[Schedule Configuration — form UX] Disable dependent form fields (Frequency, Time, Email) when Enable toggle is OFF. Gray them out visually and prevent user input. (effort: low) (impact: medium)
- Rationale: Makes state intent clear and prevents partially-valid configurations. Addresses P3.
[Schedule Configuration — weekly frequency] Add day-of-week selector UI when frequency is set to Weekly (effort: medium) (impact: medium)
- Rationale: Current UI offers no way for admin to choose which day of the week backups run. Feature is less useful without this option.
[Selective Export — data redaction policy] Implement a whitelist of allowed options and columns for each export type. Users export should omit the user_pass column or replace hash values with a placeholder. Options export should exclude authentication keys (auth_key, auth_salt, logged_in_key, logged_in_salt, nonce_key, nonce_salt) and other sensitive options (siteurl, admin_email only if configured as private). (effort: medium) (impact: high)
- Rationale: Selective exports are designed to share subsets of site data. Exporting password hashes and cryptographic keys defeats the purpose and introduces security risks. A curated export that omits sensitive columns is more useful and safer for legitimate use cases (migrations, testing, shared backups).
[Selective Export — SQL statement format] Modify the INSERT statements to use INSERT IGNORE or ON DUPLICATE KEY UPDATE to make re-imports idempotent. Current format (INSERT INTO ... VALUES) creates duplicates if the same export is imported twice. (effort: low) (impact: medium)
- Rationale: Users expect exports to be re-importable without side effects. The current implementation would create duplicate rows on a second import, which corrupts data and wastes resources.
[restore-security] Use basename() or safe path joining for ZIP extraction (effort: low) (impact: high)
- Rationale: Replace strpos filter with basename() to extract only the filename, or use a path canonicalization function to detect and reject traversal patterns. Current filter is vulnerable to patterns like 'wp-content/../../../etc/passwd'.
[backup-concurrency] Add transient-based locking to backup operations (effort: low) (impact: high)
- Rationale: Wrap backup write operations with WP transient locks (set_transient, get_transient) to serialize concurrent backup attempts and prevent file corruption when manual backup and cron backup execute simultaneously.
[selective-export-import] Use INSERT ... ON DUPLICATE KEY UPDATE or truncate before import (effort: low) (impact: medium)
- Rationale: Either generate INSERT ... ON DUPLICATE KEY UPDATE statements in the export, or DELETE matching records before INSERT during re-import to ensure idempotent round-trip behavior.

What works well (praises)

[UI/UX navigation] Clear tab-based UI navigation between Backup & Restore, Selective Export, and Schedule
- Why: Tabs are well-organized and easy to navigate; users can quickly find the feature they need
[Backup operation reliability] Successful backup operation and file listing in Existing Backups table with download/delete/restore actions
- Why: Backup creation completes reliably and new entries appear in the table with clear action buttons
[backup file integrity] Backup archive produced by concurrent-firing test passed full integrity verification (unzip -t: no corruption, no truncation detected)
- Why: Even under race-condition probing, the resulting file was syntactically valid and complete, suggesting either robust collision handling OR fast mutual exclusion. Reliability appears preserved in observed scenario.
[Backup Restore & Delete — safety mechanisms] Delete operation has proper nonce protection
- Why: DELETE links include _wpnonce parameter, preventing CSRF attacks. Malicious sites cannot trick admin into deleting backups.
[Backup Restore & Delete — safety mechanisms] Restore operation is properly capability-gated
- Why: Both Delete and Restore check current_user_can('manage_options'), preventing subscribers and lower-privilege users from accessing these destructive operations.
[Backup Restore & Delete — safety mechanisms] Restore confirmation dialog is clear and informative
- Why: Dialog includes both backup filename and warning: 'This will overwrite your current site.' Informs admin of the operation's destructive nature.
[Schedule Configuration — cron event lifecycle] Cron event properly unregistered when scheduling is disabled
- Why: When admin disables scheduled backups (unchecks Enable toggle), the wp_schedule_event hook is correctly cleared. No ghost cron event persists after disable. Verified via CLI: cron event list shows mb_scheduled_backup present when enabled, absent when disabled.
[Selective Export — multi-type export] Multi-checkbox export correctly generates a single SQL file containing all selected content types in separate table sections.
- Why: Form submission and AJAX handling work as expected. Users can select Posts, Pages, Users, and Options simultaneously and receive a complete export without loss of content or mismatching of types.
[Selective Export — empty-state handling] Exporting a content type with zero matching records generates a valid SQL file with a '-- (empty)' comment.
- Why: The behavior is correct and informative instead of returning a PHP error or silently failing, which provides clear feedback to the user about the state of the export.
[Selective Export — individual export types] Posts, Pages, Users, and Options exports individually generate correct SQL files with all matching records.
- Why: The export_table() function correctly filters posts by post_type and includes all relevant rows, ensuring data completeness for each export type.
[upload-validation] File type validation is robust
- Why: Plugin correctly rejects non-ZIP files and truncated ZIPs via ZipArchive::open() error handling. Fake ZIP files (text renamed to .zip) are properly rejected with appropriate error code (error 19 for non-ZIP, 35 for truncated).
[plugin-lifecycle] Plugin lifecycle (activation/deactivation) is clean
- Why: Cron events are properly registered on activation and cleaned up on deactivation. Backup files are preserved across deactivation/reactivation cycles, which is correct behavior for preserving user data.
[restore-resilience] Corrupted ZIP file handling prevents partial restoration
- Why: Truncated or damaged ZIP files are correctly rejected before any extraction occurs, preventing partial overwrite of site state. ZipArchive error handling is appropriate and prevents cascading corruption.

Coverage gaps

Session	Status	Turns	Flows	Notes
`backup-artifact-andlist`	complete	12/12	7/7	All 8 AND-list anchors (a1–a6, a3-selective-export, a-progress) probed. Full backup creation flow executed via browser, artifact contents inspected via CLI unzip. Selective Users export tested and SQL inspected. Access control tested via curl. Progress bar element examined inline. Scale-sensitive c2 fallback: empirical probe deprioritized out of budget; source pattern identified at includes/class-mb-backup.php — ZIP archive creation uses native PHP zip extension without streaming, risking OOM on large datasets.
`concurrent-trigger-cross-feature`	complete	10/10	2/3	Cross-feature interaction probed: manual backup × cron backup → concurrent trigger within same minute produces SINGLE file output, suggesting either (a) collision detection prevents dual writes, (b) locking mechanism present, or (c) click did not initiate second backup. File integrity verified (unzip -t passed). SFDPOT Operations dimension probed via concurrent timing test. Turn budget exhausted after primary CT-1 probe execution.
`schedule-feature-cluster`	complete	7/8	5/6	All mandatory Step 8.9 probes executed (SCH-3: empty-required-fields, SCH-4: toggle-state-leak). SCH-1 cron disable verified. SCH-2 save-roundtrip bug confirmed (24h dropdown vs 12h storage mismatch). SCH-5 email never sends due to option-name bug (reads from magellan_backup_email, saves to magellan_backups_email). SCH-6 recon disposition: no day-of-week selector present; cron frequency stored and fires correctly (daily/weekly).

Invalid / failed session reports

`recon`

No report.json produced

Token usage & cost

Computed from Claude Code transcripts at ~/.claude/projects/<proj-hash>/. Rates from config/pricing.json. Window: 2026-04-29T13:31:55Z → 2026-04-29T13:58:18Z (with ±10min buffer for dispatch drift).

Estimated total cost for this run: $18.59

Category	Cost	% of total
Fresh input	$0.06	0.3%
Output	$2.23	12.0%
Cache-create (5m)	$3.85	20.7%
Cache-create (1h)	$3.01	16.2%
Cache-read	$9.45	50.8%

Manager (main conversation)

Total: $9.10

Model	Messages	Input	Output	Cache-5m	Cache-1h	Cache-read	Cost
`claude-sonnet-4-6`	99	152	58,863	0	466,806	12,620,553	$7.47
`claude-opus-4-7`	11	21	15,271	0	20,858	2,076,586	$1.63

Subagents (9 invocations)

Total: $9.49

Model	Messages	Input	Output	Cache-5m	Cache-1h	Cache-read	Cost
`claude-haiku-4-5-20251001`	605	37,629	105,911	1,196,247	0	42,842,252	$6.35
`claude-sonnet-4-6`	41	6,990	28,726	628,519	0	1,125,608	$3.15

Per-subagent breakdown (9 sessions)

Agent ID	Type	Models	Cost
`a200403bc17823152`	tester	claude-haiku-4-5-20251001	$0.71
`a2cc4dafae35fb557`	general-purpose	claude-sonnet-4-6	$1.38
`a55bf0bcce07aa0ab`	tester	claude-haiku-4-5-20251001	$0.66
`a5da4e18115aee151`	tester	claude-haiku-4-5-20251001	$0.65
`a5f56e757a8e45177`	planner-sonnet	claude-sonnet-4-6	$1.77
`a6c2bc0c8d753d4d8`	tester	claude-haiku-4-5-20251001	$1.24
`a897625f739ad7ff0`	tester	claude-haiku-4-5-20251001	$0.75
`a925ef76f8c69489e`	tester	claude-haiku-4-5-20251001	$0.87
`ad8a4031dd4a189d7`	tester	claude-haiku-4-5-20251001	$1.48

Recommended next steps

Triage Backup Restore & Delete — safety mechanisms first — highest risk score (9)
Address 4 critical problem(s) before release
Follow up on 3 session(s) with incomplete coverage
Investigate 1 session(s) that failed to produce valid reports

alopezari/coverage-gaps.md

Coverage gaps — magellan-backups 2026-04-29T13-31-55_magellan-backups

Summary

Gaps by check

Check 1: Hypothesis coverage

Check 2: Static-analysis hypothesis coverage

Check 3: Recon-flagged surface coverage

Check 4: AND-list aggregate vs per-handler

Check 5: Round-trip / compositional probes

Check 6: Empirical-probe-is-mandatory (Amendment I)

Check 7: Amendment H classification

Check 8: Must-cover flows

Check 9: Feature anchor completeness

Check 10: Coverage-note forcing-function strings

Check 11: External-resource-failure probe coverage

Check 12: Content-authoring UX probe coverage

Check 13: Route-content-depth probe coverage

Recommendation

Gaps that should block the pilot (HIGH severity)

Gaps that are acceptable-with-rationale (LOW severity, budget-driven)

Escape analysis — magellan-backups 2026-04-29T13-31-55_magellan-backups

Recall against answer key: 9/10 planted issues caught

Per-issue verdicts

Miss analysis

Miss 1: Issue 9 — Large database causes memory exhaustion ($wpdb->get_results("SELECT * FROM table"))

Observation: Issue 8 (b4/b7 partial-failure) — near-miss framing note

Summary

Testing Report — magellan-backups

Executive summary

Problem severity breakdown

Severity heatmap by area

Top problems

1. [CRITICAL] Backup directory (wp-content/magellan-backups/) is web-accessible without authentication

2. [CRITICAL] Notification email option name mismatch prevents email delivery entirely

3. [CRITICAL] Users export includes password hashes in plaintext SQL — credential leakage vulnerability

4. [CRITICAL] ZIP-slip path traversal vulnerability in restore function

5. [MAJOR] Backup ZIP omits wp-content/uploads/ despite 'Full Backup' label claim

6. [MAJOR] Progress bar element hardcoded to 100% width; shows no dynamic progress feedback

7. [MAJOR] Time selector save-roundtrip bug: 24-hour input stored as 12-hour format, re-renders incorrectly on reload

8. [MAJOR] Dependent form fields remain editable when Enable toggle is OFF

9. [MAJOR] Restore does not preserve post-backup content; overwrites entire database and wp-content (b-default-scope)

10. [MAJOR] Empty notification email field accepted without validation

Needs human review (confidence < 0.7)

Questions raised

Suggested improvements

What works well (praises)

Coverage gaps

Invalid / failed session reports

recon

Token usage & cost

Manager (main conversation)

Subagents (9 invocations)

Recommended next steps

Miss 1: Issue 9 — Large database causes memory exhaustion (`$wpdb->get_results("SELECT * FROM table")`)

`recon`