March 2026 Target files: kaizen-evaluate/SKILL.md (new phase), kaizen-implement/SKILL.md (plan schema)
This phase inserts between Phase 3.7 (Architecture & Tooling Fitness) and Phase 5 (Ask the Admin). It is Phase 4.5: Plan Formation.
The existing Phase 4 (Critique the Spec) stays in place — it runs before plan formation, not after. The new phase runs after Phase 4 and produces the plan that Phase 5.5 (Plan Coverage Review) will evaluate.
Before drafting any plan, complete these five steps in order. Each step takes 2-5 minutes. For a simple issue, the whole phase takes under 10 minutes. For a complex issue, up to 20 minutes. These steps exist because the plan is your first commitment — it defines what you will build, where you will build it, and how you will know you are done. Forming it without grounding produces plans that pass coverage review while solving the wrong problem.
The steps are not a checklist to rush through. Each one asks you to look at something you have not looked at yet. Do them in order. The output of each step feeds the next.
Read the issue body. Find the section that answers: what observable failure prompted this issue? Not the proposed solution. The original pain. Write it in one or two sentences in this form:
GOAL: [what the user/system can't do now that they should be able to do]
DONE WHEN: [the specific observable outcome that means the goal is achieved]
The "done when" must be verifiable by an external observer without reading the implementation. "Tests pass" is verifiable. "The feature works" is not. "Running npm test produces 0 failures and the skill chain is visible in kaizen-list-skills --show-deps" is verifiable.
Write these two lines before reading any code, before considering any solution. This anchors everything that follows. Every plan step you add later must connect back to this. If a step cannot be traced to the DONE WHEN criterion, remove it or add a step that makes the connection explicit.
If the issue body does not clearly state the observable goal, that is a signal. Either the issue is under-specified (file this as a concern in Phase 4's spec critique), or the goal is implicit and you must surface it by reading related issues or asking the admin.
Time budget: 3-5 minutes.
Before designing a solution, survey the codebase for relevant existing tools, patterns, and infrastructure. This step prevents the most common planning failure: designing custom solutions for problems that are already solved.
Run these in order, stopping when you have enough context:
# 1. Read CLAUDE.md Key Files table — the 10-minute overview
grep -A 50 "## Key Files" CLAUDE.md | head -60
# 2. Search for existing tools related to your problem area
# If the issue involves state/storage:
grep -r "cli-section-editor\|store-metadata\|write-attachment\|store-plan" src/ --include="*.ts" -l
# If the issue involves hooks:
cat docs/hooks-design.md
# If the issue involves review/plan analysis:
ls prompts/
npx tsx src/cli-dimensions.ts list
# 3. Check package.json for relevant libraries before hand-rolling
grep -E "your_keyword" package.json
# 4. Search for existing implementations of similar logic
grep -r "similar_function_name_or_concept" src/ --include="*.ts" -l | head -10Decide whether an existing tool is relevant by asking: does it already solve the core problem, or does it solve an adjacent problem? If it solves the core problem, use it. If it solves an adjacent problem, note how you will integrate with it. If nothing exists, state that explicitly in the plan's "Information Retrieved" section.
What you are looking for: existing storage primitives, existing CLI entry points for the problem domain, existing test harness patterns for this type of code, existing DI patterns in adjacent files.
How long this should take: 5-10 minutes. Run the grep commands in parallel where possible. You are not reading all the files — you are scanning for what exists. Once you know the landscape, stop.
Output: One sentence per relevant finding: "Found cli-section-editor.ts — named attachments already exist for issue storage. Plan will use write-attachment rather than building a new storage layer."
At the point of highest design risk in your plan — the choice that is most irreversible or has the widest blast radius — name at least two alternatives and state why you are rejecting all but one.
Highest design risk means: the choice that determines where state lives, who owns a contract between two components, or which infrastructure you build on. This is the choice you will regret if you get it wrong. If you cannot identify such a choice, your plan is either trivial (no design risk — alternatives step is lightweight) or you have not looked closely enough.
For each alternative, write:
OPTION A: [name — one line description] — SELECTED
Why it works: [one sentence]
Failure mode if I am wrong: [one sentence — what breaks if this choice is bad]
OPTION B: [name — one line description] — REJECTED
Rejected because: [specific failure mode that disqualifies it, or why A is better]
Calibration by issue complexity:
- Simple issue (single file, no new abstractions, no state): two options minimum, one sentence each. Total: 3 minutes.
- Medium issue (new function, one integration seam, existing tests to extend): two options, two sentences each. Total: 5 minutes.
- Complex issue (new module, state ownership decision, inter-component contract): three options, each with a named failure mode. Total: 10-15 minutes.
Do not write more alternatives than you can evaluate in the time budget. Three is the maximum for most issues. The goal is to name the rejected path, not to exhaustively survey the design space.
The rejection rationale must name a failure mode, not a preference. "Option B seemed messier" is not a rejection rationale. "Option B loses all findings if the session dies between coordination and write — unacceptable given that sessions routinely run 30+ minutes" is a rejection rationale.
The issue body's "proposed fix" or "suggested approach" section is the issue author's hypothesis about what will work. It is not a specification. Before you plan to implement it, determine whether it addresses the right failure mode.
For the proposed fix, state:
HYPOTHESIS: [what the proposed fix assumes about the root cause]
VALIDATION: [what would confirm this assumption — ideally a test you can run in <15 min]
IF WRONG: [what would happen — the problem persists, or a different problem emerges]
Then run the fastest validation available:
- If the issue has a reproduction case, reproduce it now and confirm the failure is what the issue describes.
- If the issue cites a specific file or function as the root cause, read that file now and confirm the behavior matches the description.
- If the issue proposes a fix that changes event ordering, a configuration value, or a regex pattern, check whether the proposed value is actually correct before planning to implement it.
Do not skip this step for "obvious" fixes. The most expensive planning failures in kaizen's history came from treating proposed fixes as specifications: a lint hook was built with 22 tests for the wrong problem (#724), an event ordering fix was implemented without confirming event ordering was actually the cause, a CI timeout was masked without addressing the slowness the issue explicitly said to fix (#816). In each case, the proposed fix was plausible — that is why the step is mandatory, not optional.
When to call validation done: When you can state: "The issue's proposed fix addresses the failure mode I can observe" or "The proposed fix does not address the failure mode — I am pivoting to [alternative] because [evidence]."
Time budget: 5-10 minutes for non-trivial issues. 0 minutes for issues where the root cause is confirmed from a stack trace or a passing test.
Before deciding where new logic will live, confirm that location is testable in isolation. This step prevents the most common implementation failure: 70 lines of correct logic placed in main() where it cannot be unit tested.
For each significant behavior in your plan, state:
BEHAVIOR: [what the logic does]
LIVES IN: [file and function/class where it will be implemented]
TESTED IN: [specific test file path that will cover it]
TEST APPROACH: [unit / integration / E2E — and why]
SEAM: [what interface boundary isolates this behavior for testing]
A "seam" is the injection point that lets a test replace a real dependency with a controlled one. If you cannot name the seam, the behavior is not testable in isolation. Extract it into a separate function or module before planning to implement it inline.
Red flags that require extraction before implementation:
- The planned implementation location has more than 5 imports at the top of its file
- The behavior is inside a CLI entry point (
main(), a hook's top-level execution block, or a script's global scope) - Testing the behavior would require mocking more than 3 modules simultaneously
If any red flag fires, add an extraction step to your plan before the implementation step. The extraction is not optional — it is cheaper to extract now than to discover testability problems after the code and tests are written.
Time budget: 3-5 minutes.
(The sections above are the canonical text for each step. Below are condensed reference versions for use in the Workflow Tasks table and Phase 5 summary.)
Step 1 (Success Criteria): Extract GOAL and DONE WHEN before reading any code. Verifiable outcome, not a task description.
Step 2 (Codebase Survey): Read CLAUDE.md Key Files, grep for existing tools in the problem domain, check package.json for libraries. 5-10 minutes. Document what you found.
Step 3 (Alternatives): Name the highest-risk design choice. Write two options minimum. Rejection rationale must name a failure mode.
Step 4 (Hypothesis Validation): State the proposed fix's assumption. Run the fastest test to confirm or falsify it. 5-10 minutes for non-trivial issues.
Step 5 (Testability Seams): For each behavior: name the file, the test file, the approach, and the seam. Extract before implementing if any red flag fires.
The following is the complete Phase 4.5 as it would appear in the SKILL.md file.
Before writing any plan, form it through five grounding steps. These steps exist because the first plan you write without grounding will address what the issue says to build, not what will make the problem stop happening. The grounding takes 10-20 minutes. It prevents the 30-minute implementation of the wrong thing.
Extract the success criteria first. Read the issue body. Find the observable failure that motivated the issue — not the proposed fix, the original pain. Write it in two lines:
GOAL: [what the user/system can't do now]
DONE WHEN: [the specific verifiable outcome that means it's fixed]
Verifiable means: an external observer can check it without reading the implementation. Write this before you look at any code. Every plan step you add must connect back to DONE WHEN. Steps that don't are building infrastructure, not solving the problem.
Survey what already exists. Before designing a solution, read CLAUDE.md's Key Files table. Then grep for existing tools in your problem domain:
# Storage/attachment problems:
grep -r "cli-section-editor\|write-attachment\|store-plan\|store-metadata" src/ --include="*.ts" -l
# Hook problems:
cat docs/hooks-design.md
# Review/dimension problems:
npx tsx src/cli-dimensions.ts list && ls prompts/For each existing tool you find: does it solve the core problem, or an adjacent one? If it solves the core problem, use it. If it solves an adjacent problem, note the integration point. State what you found (or that nothing was found) in the plan's "Information Retrieved" section. Skipping this step is how plans design custom storage over cli-section-editor.ts, which already exists and is tested.
Generate and reject at least one alternative. Identify the highest-risk design choice in your plan — the choice that determines where state lives or who owns an inter-component contract. Write two options and reject all but one:
OPTION A: [description] — SELECTED
Failure mode if wrong: [one sentence]
OPTION B: [description] — REJECTED
Rejected because: [specific failure mode that disqualifies it]
The rejection rationale must name a failure mode, not a preference. "Cleaner" is not a failure mode. "Loses all state if the session dies before the batch write completes" is a failure mode. If there is no irreversible choice in your plan, two options with one-sentence rationale is sufficient. If the plan touches state ownership or interface contracts, three options with named failure modes.
Validate the proposed fix's assumption. The issue body's "proposed fix" is the issue author's best guess. Before planning to implement it, state what it assumes and run the fastest test to confirm or falsify:
HYPOTHESIS: [what the proposed fix assumes about the root cause]
VALIDATION: [what you will run or read to confirm — must take <15 min]
IF WRONG: [what evidence would disqualify this hypothesis]
Run the validation before committing to the plan. For a code behavior issue: reproduce the failure and confirm it matches the description. For a configuration or regex issue: check the proposed value against a concrete case. For an architecture issue: read the affected file and confirm the structure matches what the issue describes. Do not skip this for "obvious" fixes — three of kaizen's most expensive multi-PR cycles came from planning implementations of plausible but wrong hypotheses.
Map the testability seams before placing any code. For each significant behavior in the plan, state:
BEHAVIOR: [what it does]
LIVES IN: [file.ts, functionName()]
TESTED IN: [tests/test_file.ts or tests/test_file.sh]
SEAM: [the injection point that isolates this for testing]
If you cannot name the seam, the behavior is not testable in isolation. Add an extraction task before the implementation task. Red flags requiring extraction: the target location has more than 5 imports, the location is a CLI entry point or script's global scope, or testing it would require mocking more than 3 modules. Extract first, implement second — this is never the optional step.
Write the plan. With all five steps complete, write the plan using this structure:
## Success Criteria
GOAL: [from step 1]
DONE WHEN: [from step 1]
## Information Retrieved
- [source]: [what you found] — [how it changes or confirms the plan]
- (or: "No relevant existing tools found for [domain]")
## Design Alternatives Considered
### Option A: [description] — SELECTED
Failure mode if wrong: ...
### Option B: [description] — REJECTED
Rejected because: ...
## Tasks
[Ordered, concrete, traceable to DONE WHEN]
## Seam Map
[Per-behavior: file, test file, seam]
## Test Plan
[Per-task: what invariant is tested, which test file, unit/integration/E2E]Store the plan immediately after writing it:
npx tsx src/cli-structured-data.ts store-plan --issue {N} --repo "$ISSUES_REPO" --file plan.mdThen proceed to Phase 5 (Ask the Admin). The plan coverage review (Phase 5.5) runs after the admin approves direction. The plan formed here is the input to that review.
Time budget: Simple issue (single file, no new abstractions): 10-12 minutes total. Complex issue (new module, state decision, multi-component wiring): 15-20 minutes. If this phase is taking longer than 20 minutes, you are either designing rather than surveying (go back to step 2 and find what already exists) or the issue requires /kaizen-prd before evaluation.
The text above is written in the same imperative voice as the existing phases. Key conventions matched:
- Phases use present-tense imperatives ("Write this before...", "State what you found...", "Run these in order...")
- Each phase states why the step exists before stating what to do, using concrete incident references where they exist
- Code blocks for exact commands; prose for judgment calls
- Time budgets are given explicitly rather than implied
- Red flags are named as patterns, not as vague warnings
The "Full Integrated Phase" (section 7) is the actual SKILL.md text. Sections 2-6 are reference material for this document.
Insertion point: After the existing "Scope Reduction Discipline" gate (which ends Phase 3) and before the existing "Phase 4: Critique the Spec." The numbering becomes: Phase 3 (Assess Implementation) → Phase 3.5 (Hypotheses) → Phase 3.7 (Architecture Fitness) → Scope Reduction Discipline → Phase 4 (Critique Spec) → Phase 4.5 (Plan Formation) → Phase 5 (Ask Admin) → Phase 5.5 (Plan Coverage Review) → Phase 6 (Capture Lessons).
Why not earlier? Phase 4.5 comes after the spec critique because the plan should reflect what the critique found. If Phase 4 reveals the spec's proposed solution is wrong, Phase 4.5's hypothesis validation step will catch this before the plan commits to the wrong approach. Running Phase 4.5 before Phase 4 would produce a plan that the spec critique then contradicts.
Why not inside kaizen-implement? The five steps are pre-implementation work. kaizen-implement receives a plan that already has success criteria, alternatives, and seam maps. Moving this work into evaluate keeps implement as a pure execution engine — it should not need to discover basic design decisions mid-implementation.