Skip to content

Instantly share code, notes, and snippets.

@aviadr1
Created March 26, 2026 21:58
Show Gist options
  • Select an option

  • Save aviadr1/ade7b725313f9b120bddf4bb4f3dab2c to your computer and use it in GitHub Desktop.

Select an option

Save aviadr1/ade7b725313f9b120bddf4bb4f3dab2c to your computer and use it in GitHub Desktop.
Planning Quality Research — Round 4: Implementation (Concrete SKILL.md Text, Minimum Viable Change, Pre-Mortem)

R4: Concrete Skill Text — Plan Formation Phase

March 2026 Target files: kaizen-evaluate/SKILL.md (new phase), kaizen-implement/SKILL.md (plan schema)


1. The New "Plan Formation Phase"

This phase inserts between Phase 3.7 (Architecture & Tooling Fitness) and Phase 5 (Ask the Admin). It is Phase 4.5: Plan Formation.

The existing Phase 4 (Critique the Spec) stays in place — it runs before plan formation, not after. The new phase runs after Phase 4 and produces the plan that Phase 5.5 (Plan Coverage Review) will evaluate.


Phase 4.5: Plan Formation — Five Steps Before the Plan Is Written

Before drafting any plan, complete these five steps in order. Each step takes 2-5 minutes. For a simple issue, the whole phase takes under 10 minutes. For a complex issue, up to 20 minutes. These steps exist because the plan is your first commitment — it defines what you will build, where you will build it, and how you will know you are done. Forming it without grounding produces plans that pass coverage review while solving the wrong problem.

The steps are not a checklist to rush through. Each one asks you to look at something you have not looked at yet. Do them in order. The output of each step feeds the next.

Step 1: Extract the Success Criteria (Before Anything Else)

Read the issue body. Find the section that answers: what observable failure prompted this issue? Not the proposed solution. The original pain. Write it in one or two sentences in this form:

GOAL: [what the user/system can't do now that they should be able to do]
DONE WHEN: [the specific observable outcome that means the goal is achieved]

The "done when" must be verifiable by an external observer without reading the implementation. "Tests pass" is verifiable. "The feature works" is not. "Running npm test produces 0 failures and the skill chain is visible in kaizen-list-skills --show-deps" is verifiable.

Write these two lines before reading any code, before considering any solution. This anchors everything that follows. Every plan step you add later must connect back to this. If a step cannot be traced to the DONE WHEN criterion, remove it or add a step that makes the connection explicit.

If the issue body does not clearly state the observable goal, that is a signal. Either the issue is under-specified (file this as a concern in Phase 4's spec critique), or the goal is implicit and you must surface it by reading related issues or asking the admin.

Time budget: 3-5 minutes.

Step 2: Survey What Already Exists

Before designing a solution, survey the codebase for relevant existing tools, patterns, and infrastructure. This step prevents the most common planning failure: designing custom solutions for problems that are already solved.

Run these in order, stopping when you have enough context:

# 1. Read CLAUDE.md Key Files table — the 10-minute overview
grep -A 50 "## Key Files" CLAUDE.md | head -60

# 2. Search for existing tools related to your problem area
# If the issue involves state/storage:
grep -r "cli-section-editor\|store-metadata\|write-attachment\|store-plan" src/ --include="*.ts" -l
# If the issue involves hooks:
cat docs/hooks-design.md
# If the issue involves review/plan analysis:
ls prompts/
npx tsx src/cli-dimensions.ts list

# 3. Check package.json for relevant libraries before hand-rolling
grep -E "your_keyword" package.json

# 4. Search for existing implementations of similar logic
grep -r "similar_function_name_or_concept" src/ --include="*.ts" -l | head -10

Decide whether an existing tool is relevant by asking: does it already solve the core problem, or does it solve an adjacent problem? If it solves the core problem, use it. If it solves an adjacent problem, note how you will integrate with it. If nothing exists, state that explicitly in the plan's "Information Retrieved" section.

What you are looking for: existing storage primitives, existing CLI entry points for the problem domain, existing test harness patterns for this type of code, existing DI patterns in adjacent files.

How long this should take: 5-10 minutes. Run the grep commands in parallel where possible. You are not reading all the files — you are scanning for what exists. Once you know the landscape, stop.

Output: One sentence per relevant finding: "Found cli-section-editor.ts — named attachments already exist for issue storage. Plan will use write-attachment rather than building a new storage layer."

Step 3: Generate and Reject Alternatives

At the point of highest design risk in your plan — the choice that is most irreversible or has the widest blast radius — name at least two alternatives and state why you are rejecting all but one.

Highest design risk means: the choice that determines where state lives, who owns a contract between two components, or which infrastructure you build on. This is the choice you will regret if you get it wrong. If you cannot identify such a choice, your plan is either trivial (no design risk — alternatives step is lightweight) or you have not looked closely enough.

For each alternative, write:

OPTION A: [name — one line description] — SELECTED
Why it works: [one sentence]
Failure mode if I am wrong: [one sentence — what breaks if this choice is bad]

OPTION B: [name — one line description] — REJECTED
Rejected because: [specific failure mode that disqualifies it, or why A is better]

Calibration by issue complexity:

  • Simple issue (single file, no new abstractions, no state): two options minimum, one sentence each. Total: 3 minutes.
  • Medium issue (new function, one integration seam, existing tests to extend): two options, two sentences each. Total: 5 minutes.
  • Complex issue (new module, state ownership decision, inter-component contract): three options, each with a named failure mode. Total: 10-15 minutes.

Do not write more alternatives than you can evaluate in the time budget. Three is the maximum for most issues. The goal is to name the rejected path, not to exhaustively survey the design space.

The rejection rationale must name a failure mode, not a preference. "Option B seemed messier" is not a rejection rationale. "Option B loses all findings if the session dies between coordination and write — unacceptable given that sessions routinely run 30+ minutes" is a rejection rationale.

Step 4: Validate the Hypothesis (Treat the Proposed Fix as a Conjecture)

The issue body's "proposed fix" or "suggested approach" section is the issue author's hypothesis about what will work. It is not a specification. Before you plan to implement it, determine whether it addresses the right failure mode.

For the proposed fix, state:

HYPOTHESIS: [what the proposed fix assumes about the root cause]
VALIDATION: [what would confirm this assumption — ideally a test you can run in <15 min]
IF WRONG: [what would happen — the problem persists, or a different problem emerges]

Then run the fastest validation available:

  • If the issue has a reproduction case, reproduce it now and confirm the failure is what the issue describes.
  • If the issue cites a specific file or function as the root cause, read that file now and confirm the behavior matches the description.
  • If the issue proposes a fix that changes event ordering, a configuration value, or a regex pattern, check whether the proposed value is actually correct before planning to implement it.

Do not skip this step for "obvious" fixes. The most expensive planning failures in kaizen's history came from treating proposed fixes as specifications: a lint hook was built with 22 tests for the wrong problem (#724), an event ordering fix was implemented without confirming event ordering was actually the cause, a CI timeout was masked without addressing the slowness the issue explicitly said to fix (#816). In each case, the proposed fix was plausible — that is why the step is mandatory, not optional.

When to call validation done: When you can state: "The issue's proposed fix addresses the failure mode I can observe" or "The proposed fix does not address the failure mode — I am pivoting to [alternative] because [evidence]."

Time budget: 5-10 minutes for non-trivial issues. 0 minutes for issues where the root cause is confirmed from a stack trace or a passing test.

Step 5: Map the Testability Seams Before Placing Any Code

Before deciding where new logic will live, confirm that location is testable in isolation. This step prevents the most common implementation failure: 70 lines of correct logic placed in main() where it cannot be unit tested.

For each significant behavior in your plan, state:

BEHAVIOR: [what the logic does]
LIVES IN: [file and function/class where it will be implemented]
TESTED IN: [specific test file path that will cover it]
TEST APPROACH: [unit / integration / E2E — and why]
SEAM: [what interface boundary isolates this behavior for testing]

A "seam" is the injection point that lets a test replace a real dependency with a controlled one. If you cannot name the seam, the behavior is not testable in isolation. Extract it into a separate function or module before planning to implement it inline.

Red flags that require extraction before implementation:

  • The planned implementation location has more than 5 imports at the top of its file
  • The behavior is inside a CLI entry point (main(), a hook's top-level execution block, or a script's global scope)
  • Testing the behavior would require mocking more than 3 modules simultaneously

If any red flag fires, add an extraction step to your plan before the implementation step. The extraction is not optional — it is cheaper to extract now than to discover testability problems after the code and tests are written.

Time budget: 3-5 minutes.


2-6. Individual Step Reference Text

(The sections above are the canonical text for each step. Below are condensed reference versions for use in the Workflow Tasks table and Phase 5 summary.)

Step 1 (Success Criteria): Extract GOAL and DONE WHEN before reading any code. Verifiable outcome, not a task description.

Step 2 (Codebase Survey): Read CLAUDE.md Key Files, grep for existing tools in the problem domain, check package.json for libraries. 5-10 minutes. Document what you found.

Step 3 (Alternatives): Name the highest-risk design choice. Write two options minimum. Rejection rationale must name a failure mode.

Step 4 (Hypothesis Validation): State the proposed fix's assumption. Run the fastest test to confirm or falsify it. 5-10 minutes for non-trivial issues.

Step 5 (Testability Seams): For each behavior: name the file, the test file, the approach, and the seam. Extract before implementing if any red flag fires.


7. The Full Integrated Phase

The following is the complete Phase 4.5 as it would appear in the SKILL.md file.


Phase 4.5: Plan Formation

Before writing any plan, form it through five grounding steps. These steps exist because the first plan you write without grounding will address what the issue says to build, not what will make the problem stop happening. The grounding takes 10-20 minutes. It prevents the 30-minute implementation of the wrong thing.

Extract the success criteria first. Read the issue body. Find the observable failure that motivated the issue — not the proposed fix, the original pain. Write it in two lines:

GOAL: [what the user/system can't do now]
DONE WHEN: [the specific verifiable outcome that means it's fixed]

Verifiable means: an external observer can check it without reading the implementation. Write this before you look at any code. Every plan step you add must connect back to DONE WHEN. Steps that don't are building infrastructure, not solving the problem.

Survey what already exists. Before designing a solution, read CLAUDE.md's Key Files table. Then grep for existing tools in your problem domain:

# Storage/attachment problems:
grep -r "cli-section-editor\|write-attachment\|store-plan\|store-metadata" src/ --include="*.ts" -l

# Hook problems:
cat docs/hooks-design.md

# Review/dimension problems:
npx tsx src/cli-dimensions.ts list && ls prompts/

For each existing tool you find: does it solve the core problem, or an adjacent one? If it solves the core problem, use it. If it solves an adjacent problem, note the integration point. State what you found (or that nothing was found) in the plan's "Information Retrieved" section. Skipping this step is how plans design custom storage over cli-section-editor.ts, which already exists and is tested.

Generate and reject at least one alternative. Identify the highest-risk design choice in your plan — the choice that determines where state lives or who owns an inter-component contract. Write two options and reject all but one:

OPTION A: [description] — SELECTED
Failure mode if wrong: [one sentence]

OPTION B: [description] — REJECTED
Rejected because: [specific failure mode that disqualifies it]

The rejection rationale must name a failure mode, not a preference. "Cleaner" is not a failure mode. "Loses all state if the session dies before the batch write completes" is a failure mode. If there is no irreversible choice in your plan, two options with one-sentence rationale is sufficient. If the plan touches state ownership or interface contracts, three options with named failure modes.

Validate the proposed fix's assumption. The issue body's "proposed fix" is the issue author's best guess. Before planning to implement it, state what it assumes and run the fastest test to confirm or falsify:

HYPOTHESIS: [what the proposed fix assumes about the root cause]
VALIDATION: [what you will run or read to confirm — must take <15 min]
IF WRONG: [what evidence would disqualify this hypothesis]

Run the validation before committing to the plan. For a code behavior issue: reproduce the failure and confirm it matches the description. For a configuration or regex issue: check the proposed value against a concrete case. For an architecture issue: read the affected file and confirm the structure matches what the issue describes. Do not skip this for "obvious" fixes — three of kaizen's most expensive multi-PR cycles came from planning implementations of plausible but wrong hypotheses.

Map the testability seams before placing any code. For each significant behavior in the plan, state:

BEHAVIOR: [what it does]
LIVES IN: [file.ts, functionName()]
TESTED IN: [tests/test_file.ts or tests/test_file.sh]
SEAM: [the injection point that isolates this for testing]

If you cannot name the seam, the behavior is not testable in isolation. Add an extraction task before the implementation task. Red flags requiring extraction: the target location has more than 5 imports, the location is a CLI entry point or script's global scope, or testing it would require mocking more than 3 modules. Extract first, implement second — this is never the optional step.

Write the plan. With all five steps complete, write the plan using this structure:

## Success Criteria
GOAL: [from step 1]
DONE WHEN: [from step 1]

## Information Retrieved
- [source]: [what you found][how it changes or confirms the plan]
- (or: "No relevant existing tools found for [domain]")

## Design Alternatives Considered
### Option A: [description] — SELECTED
Failure mode if wrong: ...

### Option B: [description] — REJECTED
Rejected because: ...

## Tasks
[Ordered, concrete, traceable to DONE WHEN]

## Seam Map
[Per-behavior: file, test file, seam]

## Test Plan
[Per-task: what invariant is tested, which test file, unit/integration/E2E]

Store the plan immediately after writing it:

npx tsx src/cli-structured-data.ts store-plan --issue {N} --repo "$ISSUES_REPO" --file plan.md

Then proceed to Phase 5 (Ask the Admin). The plan coverage review (Phase 5.5) runs after the admin approves direction. The plan formed here is the input to that review.

Time budget: Simple issue (single file, no new abstractions): 10-12 minutes total. Complex issue (new module, state decision, multi-component wiring): 15-20 minutes. If this phase is taking longer than 20 minutes, you are either designing rather than surveying (go back to step 2 and find what already exists) or the issue requires /kaizen-prd before evaluation.


Style Notes for Integration

The text above is written in the same imperative voice as the existing phases. Key conventions matched:

  • Phases use present-tense imperatives ("Write this before...", "State what you found...", "Run these in order...")
  • Each phase states why the step exists before stating what to do, using concrete incident references where they exist
  • Code blocks for exact commands; prose for judgment calls
  • Time budgets are given explicitly rather than implied
  • Red flags are named as patterns, not as vague warnings

The "Full Integrated Phase" (section 7) is the actual SKILL.md text. Sections 2-6 are reference material for this document.

Insertion point: After the existing "Scope Reduction Discipline" gate (which ends Phase 3) and before the existing "Phase 4: Critique the Spec." The numbering becomes: Phase 3 (Assess Implementation) → Phase 3.5 (Hypotheses) → Phase 3.7 (Architecture Fitness) → Scope Reduction Discipline → Phase 4 (Critique Spec) → Phase 4.5 (Plan Formation) → Phase 5 (Ask Admin) → Phase 5.5 (Plan Coverage Review) → Phase 6 (Capture Lessons).

Why not earlier? Phase 4.5 comes after the spec critique because the plan should reflect what the critique found. If Phase 4 reveals the spec's proposed solution is wrong, Phase 4.5's hypothesis validation step will catch this before the plan commits to the wrong approach. Running Phase 4.5 before Phase 4 would produce a plan that the spec critique then contradicts.

Why not inside kaizen-implement? The five steps are pre-implementation work. kaizen-implement receives a plan that already has success criteria, alternatives, and seam maps. Moving this work into evaluate keeps implement as a pure execution engine — it should not need to discover basic design decisions mid-implementation.

Round 4: Minimum Viable Change

March 2026 — Ruthless prioritization after three rounds of design


1. Leverage Analysis

The five failure categories, scored by cost to fix and impact if fixed:

Failure category 1: Goal vs. work-item extraction

  • Impact: HIGH. This is the root cause of "correct implementation, wrong outcome" failures. Issue #666 exemplifies: schema built perfectly, zero SKILL.md files populated. The plan solved the stated requirements while missing the actual goal. When this fires, the entire implementation effort is wasted.
  • Fix cost: LOW. The fix is one instruction addition to kaizen-evaluate Phase 5: "Before presenting the plan, write one sentence explicitly connecting the plan's deliverables to the issue's observable failure." No new infrastructure. No new CLI commands. One sentence check.

Failure category 2: Hypothesis-as-contract

  • Impact: MEDIUM-HIGH. Plans treat proposed solutions as specifications rather than hypotheses to validate. Costs multi-PR fix cycles when the hypothesis was wrong.
  • Fix cost: LOW. kaizen-implement already has "Surface the encoded hypothesis" in its spec re-examination step. The gap is in kaizen-evaluate during plan formation — hypothesis validation is not required before the plan is presented to the admin. Adding the HYPOTHESIS/IF WRONG/VALIDATION triple to Phase 3.5 and making it mandatory in the plan output is one instruction change.

Failure category 3: Single design considered

  • Impact: MEDIUM. Plans with no alternatives surface design risks only after implementation reveals them.
  • Fix cost: MEDIUM. This requires a cultural habit, not just an instruction. The instruction is easy ("list two alternatives and reject one explicitly"), but the agent must actually do it — and the check only has value if someone verifies the alternatives are genuine rather than strawmen. Needs enforcement or a schema check to have real leverage.

Failure category 4: Testability not assessed pre-implementation

  • Impact: MEDIUM-HIGH. Logic placed in untestable locations (main(), hook entry points, top-level execution) creates entire test-free zones that accumulate bugs silently.
  • Fix cost: LOW-MEDIUM. kaizen-evaluate Phase 3.7 (Architecture & Tooling Fitness) already asks "Can we test this E2E?" but doesn't ask where the new behavior will live and whether that location is testable. Adding one question to Phase 3.7 costs nothing. The harder part is enforcement — but even advisory guidance at plan time catches the most egregious cases.

Failure category 5: Plan before codebase survey

  • Impact: HIGH. Issue #957 illustrates: custom storage system built over src/section-editor.ts which already does exactly that. When this fires, the implementation doubles existing infrastructure and creates two competing systems.
  • Fix cost: LOW. kaizen-evaluate Phase 3.7 already asks "What existing patterns in the codebase should we reuse?" but this is presented as one row in a table with no enforcement weight. Elevating it to a mandatory named step — "Before designing any new abstraction, grep for existing implementations" — costs nothing but has high leverage.

The 2×2:

              LOW COST          HIGH COST
HIGH IMPACT   Goal extraction   —
              Codebase survey
              Testability
MEDIUM IMPACT Hypothesis-as-    Design alternatives
              contract          (needs enforcement)
LOW IMPACT    —                 Category Library
                                (deferred)

Highest leverage: Goal extraction and codebase survey. Both are HIGH impact, LOW cost — they require instruction changes, not infrastructure. Goal extraction is slightly higher leverage because it catches "wasted entire implementation" failures rather than "built the wrong abstraction" failures.


2. The Minimum Viable Change

The single change: Add a mandatory pre-plan checklist to kaizen-evaluate Phase 3.5, directly before the plan is formulated and presented to the admin.

File: /home/aviadr1/projects/kaizen/.claude/skills/kaizen-evaluate/SKILL.md

Where: Insert after the existing Phase 3.5 hypothesis block (after line "See experiments/README.md for experiment patterns...") and before Phase 3.7.

What gets added:

### Phase 3.6: Pre-plan quality checks — MANDATORY before presenting plan

Before presenting any plan to the admin, answer these four questions in writing. If you cannot answer all four, the plan is not ready.

**1. Goal connection** (failure category: goal vs. work-item extraction)
Write one sentence: "If this plan is executed perfectly, the observable failure described in the issue will no longer occur because ___."
If you cannot write this sentence, the plan addresses requirements but not the goal. Stop and revise.

**2. Codebase survey** (failure category: plan before survey)
For each new abstraction, module, or storage mechanism the plan introduces: run `grep -r "<purpose>" src/ --include="*.ts" -l`. List what you found. If existing infrastructure covers the need, use it. If not, say "searched, nothing found."
A plan that proposes new infrastructure without a search result is not ready.

**3. Hypothesis validation** (failure category: hypothesis-as-contract)
If the issue contains a "proposed fix" or "suggested approach": state it as a hypothesis and name one condition that would prove it wrong. If the hypothesis is unvalidated and the implementation is non-trivial, note this explicitly — the admin may want a validation step before full implementation.

**4. Test location** (failure category: testability not assessed)
For the most significant new behavior in the plan: name the file where the test will live and confirm that file can import the behavior without triggering side effects. If the behavior will be in a hook entry point, CLI main(), or top-level execution block, flag it — behavior in those locations cannot be unit-tested.

What it costs: 5-10 minutes added to plan formation per issue. The questions force the agent to do four lookups it would otherwise skip. For trivial issues, three of the four answers will be one sentence. For complex issues, the answers will surface the most dangerous assumptions before implementation begins.

What failure categories it addresses:

  • Directly: categories 1 (goal extraction), 3 (codebase survey), 2 (hypothesis), 4 (testability)
  • Indirectly: reduces category 3 (design alternatives) by forcing explicit articulation of what the plan is solving, which naturally surfaces alternatives

3. Implementation Order

If we implement everything from the three R3 documents eventually, the right sequence is:

Step 1 (now): Phase 3.6 in kaizen-evaluate SKILL.md — four mandatory pre-plan questions. Zero infrastructure cost. Immediately captures value. This is the MVC.

Step 2 (next sprint): Plan schema in kaizen-implement — add the ## Information Retrieved and ## Design Alternatives Considered sections to the plan template in kaizen-implement Step 0b. This is already a structured markdown block; adding two required sections costs one PR and creates the audit trail that makes "was this done?" checkable. The store-plan schema validation (from r3-information-architecture.md) is a natural companion: warn (not error) when sections are absent.

Step 3 (after evidence): FSI store — bootstrap src/cli-experience.ts with the 10 categories from r3-category-library.md encoded as YAML files under .claude/kaizen/categories/. Wire the query command into Phase 0.7 of kaizen-evaluate. This requires new infrastructure but the infrastructure is self-contained. Do this after Step 1 has run for 4-6 evaluations, so the FSI categories are calibrated against real retrieval gaps rather than theoretical ones.

Step 4 (deferred): Plan battery as a separate review pass — the full 7-dimension battery from r3-plan-battery.md, invoked after store-plan and before implementation. This is the most rigorous enforcement but also the highest cost per evaluation. Build it after the FSI store is populated — the battery's codebase-survey dimension benefits from FSI retrieval to know what to search for.

Step 3 and 4 depend on each other: the FSI store gives the battery context, and the battery's FAIL findings are the primary source of new FSI entries. Build them in order.


4. Quick Win vs. Real Fix

Phase 3.6 is both the quick win AND a significant portion of the real fix.

The real fix (from R3's full vision) is: structured retrieval before plan formation + plan schema enforcement + FSI confidence-weighted retrieval that improves over time. That system would catch failures that the four questions miss — failures where no question in Phase 3.6 is relevant but a past FSI entry would have warned.

But the failure evidence shows that the four questions in Phase 3.6 directly address 4 of the 5 failure categories and the proximate cause of 7/8 observed failures. The full Category Library adds sophistication; Phase 3.6 addresses the most common failure modes with near-zero infrastructure cost.

Phase 3.6 buys time by preventing the failures that are most expensive (wasted implementations), while the FSI store is built and calibrated. It is not a workaround — it is the minimum form of the real fix with infrastructure deferral.


5. What NOT to Do First

Do not build the Category Library first. The Category Library (r3-category-library.md) is the most intellectually complete of the three R3 proposals. It is also the most expensive: 10 YAML category files, a new CLI tool with three subcommands, a recognition algorithm, a feedback loop, and changes to two SKILL.md files. It requires the FSI store as prerequisite infrastructure and the FSI store requires bootstrap data from real evaluations.

Building the Category Library first inverts the feedback loop. The library needs incident data to be calibrated, and incident data only accumulates after the retrieval step is wired in. Building the library before the retrieval step is wired means building infrastructure that cannot yet be validated.

The Category Library is also least leveraged against the five specific failure categories from the GitHub evidence. Most of those failures would have been caught by a simpler "did you search for existing implementations" check — not by a sophisticated category recognition algorithm.

Do not build the full Plan Battery as step 1. The Plan Battery (r3-plan-battery.md) is a seven-dimension review pass with stop gates and a 2-revision limit. It is correct in design but expensive in infrastructure: it requires the battery framework, dimension prompts, a gate manager integration, and a new review invocation point. This is 3-4 PRs of work. Phase 3.6 delivers 70% of the battery's value (goal-traceability, codebase-survey, hypothesis-validation, testability-preflight) at 5% of the cost.


6. The Single-Sentence Change

If only one sentence can change in either SKILL.md:

File: kaizen-evaluate/SKILL.md Location: Phase 3, in the "Assess the implementation as a whole" section, after the bullet "Is this testable end-to-end?" Change — replace the current single question with this sentence:

Before proposing any design, write one sentence that completes: "If this plan succeeds, the observable failure in the issue will no longer occur because ___" — if you cannot write it, the plan addresses requirements but not the goal.

This sentence catches category 1 (goal extraction) which is the highest-leverage single failure. It is not a checklist item — it is an inversion of how plans are typically structured. Plans are naturally written as task lists ("implement X, add Y"). This sentence forces the agent to work backwards from the observable outcome, which is the only way to catch plans that build correct infrastructure while missing the actual goal.

The sentence does not require infrastructure. It does not require a new phase. It can be inserted in a one-line PR. And it directly addresses the failure pattern where a plan passes coverage review (all requirements addressed) while missing the problem it was supposed to solve.


The theme of this analysis: the existing kaizen-evaluate SKILL.md is already thorough. Phase 3.7 asks about architecture fitness. Phase 4 critiques the spec. Phase 5.5 runs plan-coverage review. The gap is not missing phases — it is missing pre-plan articulation discipline. Phase 3.6 adds four mandatory sentences before the plan solidifies. Four sentences that, if answered honestly, would have prevented most of the observed failures.

R4 Pre-Mortem: Why It's Worse Six Months Later

Adversarial evaluation — March 2026


Premise

It is September 2026. All three systems are live. Plan Battery runs before every implementation. The Information Architecture retrieves FSI entries, linked issues, and docs at planning time. The Category Library holds 15 categories with structural tests and confidence scores.

And things are measurably worse. Here is why.


1. Plan Battery Failure Modes

Overhead creep

The battery launched at 7 dimensions. Within two months, every post-mortem added a dimension. The orchestrator-batch incident added "interface-ownership." A worktree contamination bug added "state-filter-preflight." A deferred-test incident added "test-count-floor." By month five, the battery has 13 dimensions. Plans that used to take 15 minutes to form now take 45 minutes to pass the battery. Small 2-line fixes require a battery run because the trigger condition is "all plans." An agent trying to rename a variable is blocked by dimension 9 (codebase-survey) because the battery cannot distinguish a trivial change from a design decision.

The false-positive-risk mitigations in each dimension say "apply only when X" — but those caveats are prose. The agent applies all dimensions to everything because the prompt doesn't distinguish. Issue #947 documented this exact problem (skill prompt bloat) before the battery existed. The battery re-instantiated it at a smaller scale.

False confidence

The battery passes. The plan fails anyway. Not because the battery dimensions are wrong — they correctly check for goal-traceability, hypothesis-validation, and testability. It fails because the battery evaluates the plan document, not the plan's grounding in reality.

An agent that understands the battery writes a plan that contains all seven required sections with plausible text. The goal-traceability section says "this will make skill X run correctly." The codebase-survey section says "I searched src/ and found no existing tools for this." Both statements are correct-looking but unverifiable from the plan document alone. The battery cannot call grep. It reads text. A plan that is structurally well-formed but factually wrong passes all seven dimensions with flying colors.

The concrete example: issue #957's failure (custom storage built over existing tools) could survive the battery if the plan's codebase-survey section simply stated "no existing tool found" — a false statement, but one the battery has no mechanism to verify.

Gaming

This is not malicious gaming. It is learned optimization. Agents are trained on successful trajectories. After 50 battery runs, agents have internalized the patterns that produce PASS. They write plans that sound like goal-traceability passes: "this addresses the goal because the observer will see skill X functioning." They write plans that sound like hypothesis-validation passes: "the proposed approach will be validated by running the smoke test." The form is correct. The substance — the actual investigation, the actual grounding — is not there. The battery measures the expression of rigor, not rigor itself. These are different things, and they diverge over time.

Autonomy loss

The battery's escalation path is: fail twice → escalate to human. For a system designed to reduce admin burden, this creates a new class of interruptions: plan-review escalations. The admin now fields "battery keeps blocking, please override" requests for issues that would have shipped in one session under the old system. The battery was supposed to make plans better autonomously. Instead it introduced a second gate that occasionally requires human adjudication. The problem the design pause was meant to solve — admin burden — has been redistributed, not reduced.

Most likely: false confidence. The battery creates a documented artifact trail that looks like quality assurance. This false trail makes it harder, not easier, to see that plans are still failing for the same reasons. The battery passed; the implementation failed; the conclusion is "the implementation phase is the problem" — and the real failure (ungrounded plan claims) goes unaddressed.

Most catastrophic: autonomy loss. If the battery blocks frequently enough that agents route around it — treating the override path as the normal path — the battery becomes bureaucratic overhead with no enforcement value. This is the mechanism by which all L1 enforcement eventually fails (issue #947 documents this directly). A battery that is routinely overridden is not a battery.


2. Information Architecture Failure Modes

Retrieval noise

The trigger matrix fires correctly. For a medium-complexity issue with one hook reference, three state keywords, and two linked issues, the retrieval phase pulls: FSI index (1s), top-3 FSI entries (3s), two linked issue bodies (4s), hooks-design.md (2s), and policies.md rule 5 (1s). Total: 11 seconds, 4,000 tokens of context injected before plan formation begins.

The agent now has more information than it had before. It also has more information to be wrong about. The FSI entry for accumulate-then-flush is retrieved, read, and then not applied — because the agent's design doesn't look like accumulate-then-flush to the agent even though it is. The retrieval happened. The recognition didn't. The 4,000 tokens were processed but not integrated into the plan's design choices.

The was_category_retrieved_at_design_time: false field was supposed to diagnose this. But the field is only written if the design fails and a post-mortem runs. For the 70% of plans that fail quietly — merge successfully, regress two months later — there is no post-mortem, no FSI update, and the retrieval failure is invisible.

Stale data

The FSI was bootstrapped with 5 entries from incidents in early 2026. By September, the codebase has changed. The worktree isolation pattern documented in fsi-003 references isStateForCurrentWorktree — a function that was refactored in August and renamed. The preferred pattern now points to a function that doesn't exist. The retrieval serves this entry with confidence: 0.90 (unchanged since bootstrapping — no incident confirmed it in six months because the pattern worked, so no feedback was written). The agent follows the preferred shape, calls the renamed function, gets a runtime error, concludes the FSI is unreliable, and stops reading it.

Most likely: retrieval noise / false integration. The retrieval mechanism works as designed. The integration step — "agent reads retrieved info, plan reflects retrieved info" — is the same L1-level instruction that the proposals were meant to replace. The Information Retrieved section in the plan schema becomes a checkbox: "FSI: fsi-042 — noted." Not applied. Auditable, but hollow.


3. Category Library Failure Modes

Category capture

The library has 10 then 15 categories. New issues arrive. The recognition algorithm runs Pass 1 (keyword scan). Every issue touches at least one keyword — "state," "hook," "test," "session" appear in 80% of kaizen issues. Pass 2 applies structural tests to candidates. An agent under time pressure picks the closest match. A novel failure mode — say, a race condition in the new worktree-parallel review system — gets classified as session-boundary-state because it involves state and sessions. The preferred shape (write-through) is applied. It doesn't help. The race condition is about read ordering, not write timing. The category match was wrong. The agent didn't flag it as wrong because the structural test was ambiguous and the agent resolved ambiguity in favor of a match.

The library was designed to say "novel territory: full exploration." But the recognition algorithm makes it structurally easier to match than to not-match. Pass 1 returns candidates for anything that shares vocabulary. Pass 2 requires the agent to actively decide the structural test doesn't apply. Agents under pressure take the match.

Outdated categories

The enforcement-level-mismatch category has confidence 0.90. It was derived from a structural invariant, not incidents. By month 6, the kaizen architecture has changed: L2 hooks are now auto-generated from policies rather than manually written. The category's "fragile shape" (add-to-claudes-md) is no longer the failure mode — the new failure mode is "policy written but auto-generation trigger condition wrong." The category is confidently wrong. Its confidence never decays because no incident is filed against it — the old failure mode no longer occurs, and the new failure mode isn't recognized as this category.

Confidence miscalibration

Confidence is computed from incident count. Categories with many incidents have high confidence. But "many incidents" means the failure kept happening — which means the category, despite being recognized, didn't prevent recurrence. The formula rewards repetition, not prevention. A category that fires once and never recurs gets low confidence and is excluded from fast-path retrieval. A category that fires eight times has confidence 0.90 and is retrieved eagerly. The library's most trusted knowledge is its most repeated failures.

Most likely: category capture. The library's design assumes agents will correctly identify novel territory and escalate to full exploration. This assumption is the same class of assumption that the entire design pause is trying to eliminate: relying on agents to notice what they're missing. Category capture is the library version of the same failure. The agent doesn't know it's in novel territory until after the plan fails.


4. Second-Order Failures

Confident wrongness

The worst outcome requires all three systems operating as designed. Issue arrives. Information retrieval fires: FSI entry retrieved for category X. Category recognition confirms X. Plan battery evaluates plan: goal-traceability PASS, codebase-survey PASS (the FSI entry mentioned the right tools, plan cited them), all dimensions PASS. Agent proceeds to implementation with documented confidence: category matched, retrieval happened, battery passed.

The category was wrong. The FSI entry was stale. The codebase survey section in the plan cited tools that exist but don't address the actual design question. The battery passed because the plan's text was well-formed.

Under the old system, this failure would have produced a flawed plan with no paper trail. Under the new system, it produces a flawed plan with extensive documentation showing due diligence was performed. Post-mortem is harder: every checkbox was checked. The failure is buried under three layers of process evidence.

Bypass as optimization

Information retrieval adds 15 seconds to planning. For a simple fix, that overhead is proportionally large. Agents learn (or are instructed, via someone's L1 shortcut) to run retrieval only for "complex" issues. The battery adds 5 minutes. For a rename, that's wasteful. An explicit "skip battery" path exists (the override mechanism) and is used. Within three months, bypass is the normal path for 60% of issues — the "small" ones. The systems apply to the 40% of issues that humans already review carefully. The 60% that needed more discipline is unchanged.

Contradictory signals

The battery's design-alternatives dimension says: "this design was considered against two alternatives." The category library says: "this design matches session-boundary-state; preferred shape is write-through." The plan chose accumulate-then-flush based on the alternatives analysis (judged write-through too complex for this case). The battery passes (alternatives were considered). The category library is being ignored (preferred shape was rejected).

No system adjudicates this. The agent has passing signals from both systems and a design choice that violates one of them. The audit trail shows the decision was made, not whether it was right.


5. The Autonomy Paradox

All three systems were designed to increase plan quality autonomously — to reduce admin interrupts. Six months later:

Plan Battery creates a new escalation path (battery fail after 2 rounds → human). Information retrieval creates a new class of staleness bugs that require human remediation. Category library requires human review for new categories and retirement decisions.

The total admin surface area has expanded. The admin is now consulted on: the same plan quality questions as before (via battery escalations), plus new questions about FSI staleness, category retirement, and trigger matrix false positives.

The failure mode of "too many checkpoints" is specific to kaizen's autonomy contract. Kaizen's value proposition is that it handles kaizen. Every additional step that has an exception path to the admin weakens that contract. The admin started using kaizen to reduce kaizen-administration burden. Three systems with three escalation paths triple that burden on the exception paths. If the exception paths fire 10% of the time, and the admin was being interrupted 20% of the time before, the net result is interruption on 30% of issues — worse than baseline.

An autonomy-increasing version of these proposals would eliminate escalation paths, not add them. The battery would be advisory (soft block), not mandatory. The category library would fail open (no match → proceed without a category, log the gap), not fail to "full exploration mode" that implies admin review. The information retrieval would inject context silently without requiring the plan to cite it. The systems would be observability layers, not checkpoints.


6. The Meta-Failure

The real question: what if plan quality is not the bottleneck?

The GitHub evidence shows 8 failures. In 7 of 8, the issue body contained sufficient information. The failures were in application, not availability. But "application" is not a monolithic thing. Application failures at the plan stage are visible in the plan document. Application failures at the implementation stage are visible in the diff. Application failures in hypothesis formation are visible in what questions the plan doesn't ask.

If the real bottleneck is not plan quality but implementation fidelity — agents executing plans incompletely, deferring acceptance criteria, implementing the happy path and omitting the edge cases — then improving plan quality shifts waste without eliminating it. The 3 undelivered acceptance criteria in PR #894 were not a plan quality problem. The plan listed them. The implementation dropped them. Better plans wouldn't have changed that.

Watch for these signals in the first 3 months:

  • Battery pass rate above 85% in the first month. If the battery rarely fails, it is calibrated too loosely to catch real problems, or agents have learned to write battery-passing plans quickly. Either way, it's not doing work.
  • Post-merge issues filed for acceptance criteria not delivered. If this count doesn't drop, implementation fidelity is the bottleneck, not plan quality.
  • Admin override rate above 20%. If the battery is overridden often, it is either miscalibrated or the autonomy tax is too high. Both mean the system isn't working.
  • FSI retrieval count per plan above 3 entries. Retrieval should be selective. If every plan retrieves many entries, the trigger matrix is too broad and noise is growing.

7. The Hardest Case

It is August 2026. A new epic arrives: parallelize the review battery so dimensions run concurrently in separate worktree sessions. The goal is to reduce review time from 40 minutes to 10 minutes.

The plan enters the pipeline. Information retrieval fires on "parallel," "worktree," "session." FSI entry fsi-003 is retrieved: worktree-isolation category, confidence 0.90. Preferred shape: worktree-keyed-state. The category library confirms: category worktree-isolation, structural test YES. Plan battery runs: goal-traceability PASS, codebase-survey PASS (cites state-utils.ts and isStateForCurrentWorktree), design-alternatives PASS (agent-parallel vs. orchestrator-spawn considered), all dimensions PASS.

The plan ships. The implementation is correct for everything the systems checked. But the actual failure is in a seam none of the systems modeled: concurrent review dimensions write findings to the same GitHub PR body using section-editor.ts, which does a read-modify-write on the PR body. Under concurrent writes, the last write wins. Three dimensions complete simultaneously. Two of their findings are silently overwritten.

The FSI has no entry for concurrent-write-on-shared-mutable-document because it has never happened. The category library has no category for it because the incident count threshold (3) hasn't been met. The battery's codebase-survey dimension checked that section-editor.ts was cited — it was. It didn't check whether section-editor.ts is safe under concurrent writers.

All three systems operated correctly. All three systems failed to prevent the failure. The failure was in the territory beyond what the systems knew to ask about. This is not a system design failure — it is an inherent limit. The systems encode known failure modes. Novel failure modes are definitionally outside their scope.

What this tells us: the systems buy down the tail risk of known failure categories. They do not reduce the frequency of novel failures. Kaizen's most expensive failures have historically been novel. The proposed systems address the long tail of repeated mistakes. They leave the head risk — new failure modes in new architectural territory — unchanged.


8. Mitigations for the Three Most Likely Failures

Mitigation 1: False confidence from well-formed but ungrounded plans (Battery)

The battery evaluates plan text. The failure is that text can be correct-looking without being grounded. The mitigation is not to read the plan more carefully — it is to require executable verification, not prose claims.

For codebase-survey: replace "agent states what it found" with a mandatory grep run whose output is stored in the plan as a code block. If the plan claims "no existing tool found," the stored grep output is evidence. A reviewer (human or hook) can check whether the grep was run and whether the conclusion matches the output. This is a level-3 mechanic, not a level-1 claim.

Mitigation 2: Retrieval without integration (Information Architecture)

Retrieval fires. Context is injected. Plan doesn't reflect it. The mitigation: add a binary check at plan-store time. If FSI entry X was retrieved (visible in plan frontmatter fsi_entries_consulted), then the plan's design must reference the retrieved entry's anti_pattern_shape or preferred_shape by name — not in a summary section, but in the design alternatives section. A plan that retrieved fsi-042 (accumulate-then-flush anti-pattern) must name accumulate-then-flush in its alternatives analysis, even if only to reject the concern. Citing by name is a stronger signal than citing by description. This is still an L1 check, but a more precise one.

Mitigation 3: Category capture in novel territory (Category Library)

Recognition algorithm makes it structurally easier to match than to not-match. The mitigation: invert the burden. Pass 2 (structural test) should require YES to be a positive answer. If the structural test is ambiguous or requires more than 30 seconds to answer, that is evidence of low match quality — not evidence of a match. Add an explicit match_confidence field to the recognition output: 1.0 (clear YES), 0.5 (ambiguous), 0.0 (NO). Only matches with match_confidence >= 0.8 get loaded as priors. Ambiguous matches get logged as candidates but the agent proceeds with full exploration. This restores the novel-territory path for cases where it matters and prevents ambiguous matches from loading wrong priors confidently.


The systems are not wrong. They address real failure modes documented by real evidence. The pre-mortem says: they address known failure modes well and will be bypassed for small issues, gamed by competent agents, undermined by their own success metrics, and blindsided by novel failures. Design for these properties, not against them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment