Skip to content

Instantly share code, notes, and snippets.

@aviadr1
Created March 26, 2026 21:59
Show Gist options
  • Select an option

  • Save aviadr1/18aecb0d310f4f4892a427f4a06e416c to your computer and use it in GitHub Desktop.

Select an option

Save aviadr1/18aecb0d310f4f4892a427f4a06e416c to your computer and use it in GitHub Desktop.
Planning Quality Research — Live Plan Trials: Issues #960 and #962 planned using Phase 4.5 process

Issue #960: [L2] Issue clustering and category detection — recognize when individual issues are symptoms of a systemic gap requiring meta-level prevention

Pre-Planning Research

Collision Detection

  • Open PRs matching "cluster category detect": none found.
  • Closed issues matching "cluster category detect": #463 (Workflow task expectations, closed 2026-03-22) and #504 (Batch auto-dent harness, closed 2026-03-23) — neither overlaps with clustering.
  • No active work in flight on this feature area.

Incident Gathering

  • git log --grep="cluster" surfaced commit d108c55: "fix: hook observability cluster — null-input trace, Bash gate message, import cleanup (#976)". The word "cluster" here means a grouping of related hook fixes in one PR, not issue-clustering detection. No prior work on the feature #960 proposes.
  • git log --grep="category" surfaced multiple category-adjacent commits but none related to issue-clustering detection.
  • Closed issues search for "category cluster pattern": three results — #948, #959 (both recently closed meta-issues about hypothesis validation and skill-chain verification discipline), and #504 (batch harness). None address issue-clustering detection.
  • Conclusion: #960 is greenfield. No prior attempt to build this, no collision risk.

Related Issues Read

  • #957: "Planning phase doesn't survey existing tools." Directly related — #960 is partly a meta-detector for the class of failure #957 addresses (agents rolling their own storage instead of reusing tools).
  • #940: "Auto-Dent Intelligence Debt" — cross-batch learning for auto-dent. Related in that batch-level issue grouping (Level 3 in #960's spec) depends on cross-batch data that #940 addresses. However #940 is about structured batch data persistence, not about cluster detection per se.

Existing Infrastructure Relevant to #960

  • kaizen-gaps/SKILL.md Phase 1.5 (kaizen #207, closed): already has clustering step — agents write root-cause hypotheses per open issue, group them, name clusters, and output a table. This is MANUAL clustering by the gaps agent on demand. It works but is not triggered at filing time, not triggered at evaluate time, and not automated.
  • kaizen-reflect/SKILL.md Steps 2b and 2c: already has category-naming before dispositioning (group impediments by shared root cause, then ask "new pattern or recurrence?"). This fires per-reflection. However: the scope is within a single reflection session — it only sees the current session's impediments, not historical issues across sessions.
  • kaizen-file-issue/SKILL.md Step 2 (Search for duplicates): already does gh issue list --search "<keywords>" for open and closed issues before filing. This is exact-issue deduplication, NOT cluster detection. It prevents filing a second report for the same bug. It does NOT detect "you're filing the 4th symptom of the same systemic gap."
  • kaizen-evaluate/SKILL.md Phase 0 (Collision detection): checks whether THIS specific issue is already being worked. No check for "has this class of problem recurred before?"
  • src/issue-backend.ts: pluggable issue backend (GitHub CLI wrapper for list/create/edit/comment). Could support a clustering query but has no clustering logic today.
  • src/structured-data.ts + src/section-editor.ts: storage primitives for plan/review/metadata attachments on issues and PRs. Plan would use write-attachment to store cluster signals if needed.
  • src/analysis/pr-pattern-checks.ts: detects multi-PR fix cycles (FM2) for PR-level clustering. This is structural pattern detection at the PR level. The same pattern applied to issues would serve #960's goals but it doesn't exist at the issue level today.
  • explore-gaps.md (prompt): tells the agent to look for "clusters that suggest an unnamed problem dimension" in the oldest 20 issues — but this is an LLM instruction, not a structured step, and produces no structured output.
  • reflect-batch.md (prompt): produces REFLECTION_INSIGHT: markers but no clustering step or meta-issue rate metric.
  • No cli-experience.ts, no failure-signatures/ directory — the Tier 2 FSI from the grand synthesis has not been built.

What Does NOT Exist That Would Be Needed

  1. At-filing-time cluster signal: kaizen-file-issue has a dedup search but not a cluster-severity threshold ("2+ related issues → surface cluster signal").
  2. At-evaluation-time recurrence escalation: kaizen-evaluate collision detection checks for THIS issue being fixed, not for THIS CLASS of problem having recurred.
  3. Auto-dent clustering step: explore-gaps.md and reflect-batch.md have no structured clustering step before filing; no meta-issue rate metric.
  4. Category-level test question: no prompt or step in any skill asks "what single test would prevent this whole class from recurring?"

DONE WHEN

GOAL: When a cluster of individual issues reveals a systemic gap, the kaizen system surfaces that signal automatically — at the moment a new symptom is filed, at the moment an issue is evaluated, and during auto-dent batch runs — so that agents and the admin see the cluster signal and can respond with a meta-level prevention rather than another individual fix.

DONE WHEN:

  1. Running /kaizen-file-issue on a symptom that matches 2+ existing issues (open or closed) produces a visible cluster signal before the issue is filed, and the agent names the category and asks whether to file a meta-issue instead of a symptom.
  2. Running /kaizen-evaluate on an issue that matches a prior closed issue by domain/label produces a visible recurrence escalation flag ("this class of problem has recurred — consider L2 enforcement rather than L1 fix").
  3. Running claude -p with the explore-gaps.md prompt on a repo with 5 thematically similar open issues produces at most 2 filed issues (1 meta-issue + at most 1 individual if it doesn't cluster), not 5 individual issues.
  4. Running claude -p with the reflect-batch.md prompt on a completed batch produces a meta_issue_rate metric (fraction of filed issues that were collapsed into meta-issues) in the structured insight markers.
  5. An E2E test using kaizen-test-fixture with 5 pre-seeded thematically related issues verifies outcome #3 without requiring real-world API calls to produce unpredictable results.

An external observer verifies this by: running the three skill invocations on pre-seeded fixture data and reading the output. No implementation knowledge required.


Information Retrieved (Codebase Survey)

  • kaizen-file-issue/SKILL.md Step 2: gh issue search before filing — deduplication exists, cluster-threshold detection does not. Plan will EXTEND Step 2 with threshold logic, not replace it.
  • kaizen-reflect/SKILL.md Steps 2b/2c: category-grouping within a session — partial overlap with Level 1. Plan will add a cross-session recurrence check that queries historical issues, not just current session impediments.
  • kaizen-gaps/SKILL.md Phase 1.5: manual clustering step, already in use. Proves the mental model is understood. Plan will make the same step available at filing time and in auto-dent prompts, not just in the gaps skill.
  • kaizen-evaluate/SKILL.md Phase 0: exact-issue collision detection exists. Plan will add recurrence-class detection as a new Phase 0.5 step.
  • src/issue-backend.ts: listIssues({ search, state, labels }) — can be called with keyword search. Plan will use this as the backend for the cluster query, not a new gh call pattern.
  • src/analysis/pr-pattern-checks.ts: FM2 PR-cluster detection. Plan will mirror this pattern for issues (detect by label/keyword frequency) rather than building from scratch.
  • src/section-editor.ts write-attachment: for storing cluster metadata on the meta-issue. Plan will use this rather than new storage.
  • explore-gaps.md, reflect-batch.md: prompt files with no structured clustering step. Plan will add clustering step directly to these prompts.
  • cli-experience.ts (Tier 2 FSI from grand synthesis): does NOT exist. Plan will NOT depend on it.
  • No existing semantic/embedding similarity infrastructure in the codebase. Plan must not assume LLM-call infrastructure beyond what gh provides.

Design Alternatives Considered

Design Question A: How does clustering happen?

OPTION A1: Manual — /kaizen-gaps already has Phase 1.5 for this. No new code. REJECTED because: The gap is at filing time and evaluation time, not during gap analysis. Manual is only triggered when the admin runs /kaizen-gaps. It does not surface the signal at the moment a new symptom arrives. Failure mode: the cluster is only named after 5+ symptoms have accumulated and a human decides to run a gap analysis — exactly the failure #960 describes.

OPTION A2: Threshold-based (keyword + label + time window) — SELECTED Why it works: gh issue list --search "<keywords>" --label "<area>" --state all is a 1-second CLI call. With a threshold of 2+ matches, this fires at filing time and evaluation time without requiring LLM infrastructure. It mirrors the existing dedup search in kaizen-file-issue Step 2 — the implementor extends a pattern that already exists and is understood. Failure mode if wrong: keyword search has both false positives (unrelated issues with the same term) and false negatives (related issues with different vocabulary). This is acceptable because the output is advisory ("these look related") not blocking. The agent uses judgment on the cluster signal.

OPTION A3: Semantic similarity (LLM-based grouping of issue bodies) REJECTED because: Requires calling an LLM inside a hook or skill step, adding latency (5-15 seconds per call), cost, and flakiness to a step that currently costs 0. Failure mode: a filing step that calls an LLM can time out, produce inconsistent results, or fail in batch contexts where cost is already tracked.

OPTION A4: Taxonomy-driven (issues audit flags category gaps) REJECTED because: This is what /kaizen-audit-issues already does, and it is batch/manual. It doesn't solve the "signal at filing time" problem. Failure mode: the category gap is only named during periodic audits, not when the 3rd symptom arrives.

Design Question B: What triggers the clustering check?

OPTION B1: Post-merge stop gate (after every PR merge) REJECTED because: Clustering is a forward-looking signal for filing and evaluation, not a backward-looking gate after code ships. Running it after every PR merge adds overhead to the critical path and fires too late — the symptom is already filed and fixed. Failure mode: adds noise to every merge while providing no value at the moment of filing.

OPTION B2: Skill step additions — SELECTED Add the cluster check as Step 2b in kaizen-file-issue, as Phase 0.5 in kaizen-evaluate, and as a clustering step in explore-gaps.md and reflect-batch.md. Each check is exactly one gh issue list call (already done in the dedup path) with a threshold evaluation added. Failure mode if wrong: as L1 instructions in SKILL.md files, these steps can be skipped under time pressure (per the grand synthesis's identified risk). This is mitigated by making the steps low-cost (one CLI call, one threshold check) so there's no incentive to skip, and by making the output useful (cluster signal saves time by collapsing 5 issues into 1).

OPTION B3: Scheduled/batch (auto-dent pre-pass) PARTIAL ACCEPT: The Level 3 clustering in explore-gaps.md is batch-triggered by definition. This is included in the plan as a prompt addition, not as a new trigger mechanism.

OPTION B4: New issue creation trigger (webhook/hook) REJECTED because: No webhook infrastructure exists in kaizen for GitHub issue events. Building a webhook listener to trigger clustering on new issue creation is a new infrastructure component with deployment, reliability, and security implications. Failure mode: significant build scope for a problem that can be solved in SKILL.md with L1 instrucitons, per the grand synthesis Tier 1 model.

Design Question C: What is the output when a cluster is detected?

OPTION C1: Admin notification only (surfaced as text in the skill output) PARTIAL — but insufficient on its own because the signal exists in one session and is invisible to future agents.

OPTION C2: Auto-file a meta-issue linking the clustered issues — SELECTED (for clusters ≥3 in auto-dent) Why it works: The meta-issue becomes a durable artifact. Future agents running /kaizen-evaluate on any constituent issue will find the meta-issue in their collision detection search. The meta-issue format is already defined in /kaizen-deep-dive. For filing-time clusters (size 2+): the agent is prompted to decide: file symptom + link to cluster, or file meta-issue instead. The agent makes the call based on the specific context — the system surfaces the signal and asks, but does not auto-file. Failure mode if wrong: auto-filing a meta-issue for every 3-issue cluster will create noise if keyword search returns false positives. Mitigated by: the threshold is advisory and the agent confirms before filing.

OPTION C3: Add type:cluster-symptom label to constituent issues REJECTED because: labeling requires editing N existing issues, which is API-intensive and creates a mess of partially-labeled issues if the session dies mid-way. Failure mode: partial label application leaves the label system inconsistent, and the label itself does not name the category — it just marks symptoms without telling anyone what the category is.

OPTION C4: Structured report stored on the epic issue PARTIAL ACCEPT: When a meta-issue is filed (for large clusters), the structured clustering data (constituent issues, root-cause hypothesis, category-level test idea) should be stored as an attachment using write-attachment, not free-form text. This is included in the implementation plan.

Hardest Design Decision: Who asks "what category-level test prevents this whole class?"

This is the highest-value question in the issue (#960 Level 4). The issue says: "What single meta-level test would prevent this entire category from recurring?"

OPTION D1: Ask the question in the meta-issue template — SELECTED The meta-issue body template (used when a cluster is filed) includes a section: "## Category Prevention Test — What single test would prevent this whole class?" The agent fills this in when filing the meta-issue, using the cluster's root cause as input. This is L1 — it requires the filing agent to reason about it, but the section header makes the question explicit and unavoidable. Failure mode if wrong: agents write "add tests" or "improve coverage" rather than naming a specific test invariant. Mitigated by: the section template provides an example ("all .sh files >50 lines must have a .bats file") and requires a machine-checkable formulation, not a prose description.

OPTION D2: Trigger /kaizen-deep-dive automatically when cluster is detected REJECTED because: /kaizen-deep-dive runs a full research + meta-issue + implementation cycle. Triggering it automatically on every cluster detection would be too expensive and presumes the cluster warrants a deep dive. Failure mode: every 3-issue cluster launches a deep-dive subagent that discovers it's not actually a systemic gap — wasted tokens and confusion about what's happening.


Hypothesis

HYPOTHESIS: The core assumption of issue #960 is that surfacing cluster signals at the right moments (filing, evaluation, batch runs) will cause agents to respond with meta-level prevention (a category-level test, a meta-issue, an L2 escalation) rather than another individual fix — and that this will reduce the rate of the same class of problem recurring.

VALIDATION: The cheapest available validation is to check whether the existing partial implementations (kaizen-reflect Steps 2b/2c, kaizen-gaps Phase 1.5) are actually being used and producing category issues. If agents already have the cluster-naming instruction in kaizen-reflect and still file individual symptom issues (as the issue documents with Cluster A's 5 hook-test issues filed months apart), the problem is not "the clustering step doesn't exist" but "the step is buried / not triggered at the right moment."

Run this validation before full implementation:

  1. Check the last 10 closed issues in kaizen — how many were filed as individual symptoms for a class that had a prior closed issue?
  2. Check the last 5 reflect PRs — did Steps 2b/2c fire and produce category issues?

This takes 5 minutes with gh issue list and gh pr list.

IF WRONG: If the kaizen-reflect Steps 2b/2c are already firing correctly and still failing to prevent cluster accumulation, then the hypothesis that "earlier detection = less recurrence" needs refinement. The failure mode might instead be: clusters are detected, meta-issues are filed, but the category-level test (Level 4) is never written. In that case the plan should prioritize Level 4 (the category-level test question) over Levels 1-3 (the detection machinery).

IMPORTANT NOTE: The hypothesis is probably correct but weakly so. The grand synthesis (r5-grand-synthesis.md) says the real problem is L1 steps being skipped under pressure over time. The cluster detection in kaizen-reflect already exists as L1 and is clearly being bypassed. This means the plan must be realistic: Level 1 and Level 2 of the issue's proposed fix are L1 skill-text additions that will work initially and degrade over time. They need either a test harness to verify they're being followed, or eventual escalation to L2 hooks. The plan should include this observation explicitly and file the L2 escalation path as a follow-up.


Implementation Plan

All tasks trace back to DONE WHEN criteria.

Task 1: Add cluster threshold check to kaizen-file-issue Step 2

File: .claude/skills/kaizen-file-issue/SKILL.md What: Extend the existing duplicate search (Step 2) with a threshold evaluation. After the existing gh issue list --search "<keywords>" call, count matches. If 2+ matches found: surface cluster signal. Present the matches, name the category hypothesis, and ask: "These look related — is this a symptom of a category? Consider filing a meta-issue instead." Traces to: DONE WHEN #1 — "produces a visible cluster signal before the issue is filed." Key detail: The cluster signal is advisory. The agent decides whether to file the symptom or the meta-issue. The skill text must make clear: filing the symptom is acceptable (add a cross-reference), filing the meta-issue is better (use the meta-issue template from Task 3).

Task 2: Add recurrence escalation to kaizen-evaluate as Phase 0.5

File: .claude/skills/kaizen-evaluate/SKILL.md What: Insert Phase 0.5 between Phase 0 (exact-issue collision detection) and Phase 0.5 (existing spec check). The new step: after checking if THIS issue is already being worked, check whether THIS CLASS of issue has recurred. Query: gh issue list --repo "$ISSUES_REPO" --state closed --search "<keywords from issue title>" --label "<area_label>" --limit 5. If 1+ match found: emit recurrence escalation flag: "This class of problem has occurred before (see #N). Consider escalating to L2 enforcement rather than another L1 fix." Traces to: DONE WHEN #2 — "produces a visible recurrence escalation flag." Key detail: The recurrence check uses the issue's area label as a filter to reduce false positives. A "hook test coverage" issue should query closed issues with area/testing. The instruction must be specific enough that the agent can construct the right query.

Task 3: Define the meta-issue template with a Category Prevention Test section

File: .claude/skills/kaizen-file-issue/SKILL.md (new section) OR .claude/kaizen/policies.md (if it belongs to general filing policy) What: Add a meta-issue body template that agents use when a cluster is confirmed. Template sections: "## The Category", "## Constituent Issues (Symptoms)", "## Root Cause", "## Category Prevention Test — What single test would prevent this whole class? [example: all .sh files >50 lines must have a .bats file — checkable by a CI lint]", "## Compound Fix Strategy". Traces to: DONE WHEN items 1 and 3 (meta-issue output has the Category Prevention Test section) and #960's Level 4 success criterion. Key detail: The template is in the skill text, not in a separate file, so it's always available without an extra read.

Task 4: Add clustering step to explore-gaps.md prompt

File: prompts/explore-gaps.md What: Before the "Output" section, add: "Clustering step — before filing any issues: (1) List all candidate issues you plan to file. (2) Group any that share a root cause. (3) For groups of 3+, file 1 meta-issue (using the template from kaizen-file-issue) instead of N individual issues. (4) For groups of 2, file the symptom with a cross-reference to the related issue." Also add: "Output metric: report how many issues were collapsed into meta-issues (meta-issue rate = meta_issues / (meta_issues + individual_issues))." Traces to: DONE WHEN #3 — "produces at most 2 filed issues" from 5 thematically similar ones. Key detail: The prompt currently says "look for clusters that suggest an unnamed problem dimension" but gives no structured step for handling them before filing. This task makes the step concrete.

Task 5: Add cluster analysis and meta-issue rate metric to reflect-batch.md

File: prompts/reflect-batch.md What: Add a "Cluster analysis" section to the structured reflection task. The agent must: (1) Group the issues filed in this batch by shared root cause. (2) Calculate meta-issue rate. (3) Emit: REFLECTION_INSIGHT: meta_issue_rate=<N>/<M> as a structured marker. Also: if meta_issue_rate < 20% and batch filed 5+ issues, emit REFLECTION_INSIGHT: Clustering step may have been skipped — review filed issues for category opportunities. Traces to: DONE WHEN #4 — "produces a meta_issue_rate metric in the structured insight markers." Key detail: The REFLECTION_INSIGHT: format already exists and is consumed by the auto-dent harness. The new marker uses the same format so no harness changes are required.

Task 6: Write E2E test verifying clustering behavior

File: src/e2e/cluster-detection.test.ts (new file) What: Using the kaizen-test-fixture repo, pre-seed 5 issues with related titles (hook test coverage theme). Run claude -p with explore-gaps.md prompt against the fixture. Assert: total filed issues ≤ 2 (at most 1 meta-issue + 1 individual), and at least 1 filed issue references the others as constituent issues. Traces to: DONE WHEN #5 — E2E test verifies outcome #3. Key detail: This test is necessarily an LLM-calling E2E test (per CLAUDE.md: "SKILL.md / prompt changes — the only real test is claude -p with the skill invoked"). It will be slow (~30-60 seconds) and should run in the E2E suite, not in unit tests. Mock the gh calls or use the fixture repo.

Task 7: File a follow-up issue for L2 escalation path

What: Per the Scope Reduction Discipline gate: Tasks 1-5 are L1 additions. The grand synthesis says L1 additions work initially and degrade under pressure. The L2 escalation path is: a stop-gate hook that reads the REFLECTION_INSIGHT: meta_issue_rate=... marker from the latest reflection and warns when meta_issue_rate drops below a threshold for 3+ consecutive batches. File this as a kaizen issue before the PR ships. The filed issue is the mechanism that makes "escalate to L2 later" not a promise without enforcement. Traces to: Scope Reduction Discipline gate — this is required to permit the L1-only scope of Tasks 1-5.


Testability Seam Map

BEHAVIOR: Cluster threshold check at filing time LIVES IN: .claude/skills/kaizen-file-issue/SKILL.md — Step 2b (new sub-step) TESTED IN: The behavior is LLM-driven skill text; unit-testable part is the threshold decision (2+ matches → surface signal). An extraction seam would be a helper function evaluateClusterSignal(matches: Issue[]): ClusterSignal | null in a new src/cluster-detection.ts module. TEST APPROACH: Unit test for evaluateClusterSignal (pure function, threshold logic). E2E test for the full skill invocation behavior. SEAM: evaluateClusterSignal(matches) — injected with a static issue list, no GitHub calls in the unit test.

BEHAVIOR: Recurrence escalation at evaluation time LIVES IN: .claude/skills/kaizen-evaluate/SKILL.md — new Phase 0.5 TESTED IN: E2E only for the full skill invocation; the query logic is gh CLI. If a helper function buildRecurrenceQuery(issue: {title, labels}) is extracted to src/cluster-detection.ts, it can be unit tested. TEST APPROACH: Unit test buildRecurrenceQuery — given an issue title and area label, produces the correct gh search query string. E2E test for the full evaluation behavior against fixture. SEAM: buildRecurrenceQuery(issue) — pure function, testable without gh calls.

BEHAVIOR: Clustering step in explore-gaps prompt LIVES IN: prompts/explore-gaps.md TESTED IN: src/e2e/cluster-detection.test.ts TEST APPROACH: E2E only — this is LLM-driven behavior in a prompt file. Per CLAUDE.md: cannot be unit tested. SEAM: Pre-seeded fixture issues in Garsson-io/kaizen-test-fixture. The test runs claude -p with the prompt and asserts on filed-issue count and structure.

BEHAVIOR: meta_issue_rate metric emission LIVES IN: prompts/reflect-batch.md TESTED IN: Unit test for the structured marker parsing (if the harness parses meta_issue_rate=N/M), E2E test for the full reflection behavior. TEST APPROACH: Unit test for parsing logic. E2E for the emission behavior. SEAM: The REFLECTION_INSIGHT: parser in the auto-dent harness already exists. If meta_issue_rate=N/M is parsed, the seam is the parser function. If the harness doesn't parse this format, parsing must be added — but the marker emission is still testable via regex on the raw output.

BEHAVIOR: Category Prevention Test section in meta-issue body LIVES IN: .claude/skills/kaizen-file-issue/SKILL.md — meta-issue template TESTED IN: E2E only — LLM-driven content generation. TEST APPROACH: E2E — verify that a filed meta-issue body contains a ## Category Prevention Test section with non-empty, machine-checkable content. SEAM: The filed issue body is readable via gh issue view — assertions on the section content are possible without reading the LLM internals.

RED FLAGS CHECKED:

  • kaizen-file-issue/SKILL.md is a prompt file, not TypeScript — no "5 imports" red flag applies.
  • Tasks 1-5 are prompt/skill additions. The only TypeScript is Task 6 (E2E test) and the optional helper functions in src/cluster-detection.ts.
  • src/e2e/cluster-detection.test.ts follows the existing E2E pattern in src/e2e/ (see setup-live.test.ts, issue-routing.test.ts).
  • No behavior is placed in a CLI entry point's main().

Design Decisions Persistence

What was decided

  1. Threshold-based keyword clustering over semantic similarity — the gh search --search API provides keyword matching at zero LLM cost. False positives are acceptable because the output is advisory, not blocking.

  2. Skill-text additions (L1) with a filed follow-up for L2 escalation — per the grand synthesis, L1 additions are the minimum viable change. They require no infrastructure, work immediately, and are faster to ship. The L2 escalation (stop-gate on meta_issue_rate) is filed as a kaizen issue, not deferred informally.

  3. Meta-issue template embedded in skill text — not in a separate file. Reduces the number of places agents need to look, and ensures the template is always read as part of the filing context.

  4. Category Prevention Test as a required meta-issue section — the question "what single test prevents this whole class?" is the highest-value output of #960. It must be in the meta-issue template, not left as an optional question.

  5. E2E test against kaizen-test-fixture — required because skill/prompt changes cannot be verified by unit tests alone (CLAUDE.md: behavioral tests for LLM-driven skills require claude -p invocations).

What was considered and rejected

  • LLM-based semantic similarity (Design Question A, Option A3): adds latency and cost to a step that should be fast. Rejected in favor of keyword search.
  • Post-merge stop gate trigger (Design Question B, Option B1): fires too late and too often. The right moment for cluster detection is before filing, not after merging.
  • Auto-labeling constituent issues with type:cluster-symptom (Design Question C, Option C3): API-intensive, partial failure leaves inconsistent labels. Rejected in favor of a meta-issue that links the symptoms.
  • Auto-triggering /kaizen-deep-dive on cluster detection (Design Question C, Option D2): too expensive; the cluster signal should prompt a human/agent decision, not automatically launch a full deep-dive cycle.

Honest assessment of the hypothesis

The hypothesis is correct but fragile. The kaizen-reflect Steps 2b/2c already contain category-naming logic, and yet Cluster A (5 hook-test issues filed months apart) still accumulated. This is evidence that L1 skill-text additions degrade under pressure. The L1 plan is the right first step (per grand synthesis Tier 1 sequencing), but an implementor should not assume it will permanently solve the recurrence problem. The L2 escalation follow-up (Task 7) is not optional — it is the mechanism that makes "escalate later" credible.

The issue's Level 4 output (the category-level test question) is probably the highest-leverage piece — not the detection machinery (Levels 1-3). An agent who files 5 issues and never asks "what single test would prevent the whole class?" has failed at the meta-level even if the individual issues are well-formed. The meta-issue template's Category Prevention Test section is the implementation of Level 4, and it must be non-optional.

Issue #962: CRITICAL: no mechanism to capture and file tool failures that occur inside hooks or during blocked sessions


Pre-Planning Research

Collision detection: No open PRs with titles matching "hook", "fail", "capture", "incident", "loss", or "silent". No in-flight work touches this domain.

Related issues read:

  • #934 — confirmed: Claude Code hard-checks stat(cwd) before shell spawn; deleted-CWD sessions cannot run any tool call including Stop hooks. This is one of three dead-zone scenarios.
  • #929 — review gate escape ergonomics (partially addressed by #961 context). Separate from #962.
  • #961 — MODULE_NOT_FOUND on npx tsx -e with relative imports. PostToolUse fires with exit_code=1 for this failure type, but all PostToolUse hooks return early when exit_code != 0.

Incidents from issue comment (2026-03-26): Two concrete failures during PR #956 — MODULE_NOT_FOUND on gate clear and simultaneous gate deadlock — were not auto-captured. Both required explicit user instruction to file issues.

Recent fix patterns from git log: No recent commits touching failure capture, canary, or session-alive patterns. The .kaizen-deferred-items.json pattern (gate-manager.ts) for KAIZEN_UNFINISHED was introduced in #775 — this is the closest existing analog.


DONE WHEN

GOAL: Tool failures (MODULE_NOT_FOUND, hook crashes, simultaneous gate deadlock, stuck-CWD session death) that currently vanish silently are automatically captured and surfaced to the kaizen improvement loop.

DONE WHEN: When a tool call fails with exit_code != 0 during an active session, the failure is written to a session-scoped failure queue file. When the next session starts (via kaizen-check-wip.sh SessionStart hook), any unprocessed failures from prior sessions are displayed to the agent with instructions to file kaizen issues for each one. An operator can verify this by: (1) running a command that produces a non-zero exit code, (2) starting a fresh session, and (3) observing that the SessionStart output includes a "UNRESOLVED TOOL FAILURES" section listing the failed command, its stderr, and the timestamp — without any manual intervention.


Information Retrieved (Codebase Survey)

  • src/hooks/hook-io.ts:HookInputtool_response.exit_code, tool_response.stderr, and session_id are all present in the PostToolUse JSON payload. Session ID is available to scope captures to a session.
  • src/hooks/kaizen-reflect.ts:321-323if (exitCode !== '0') return null; — confirmed: all PostToolUse hooks skip processing when exit_code is non-zero. No hook currently does anything with failed tool calls.
  • src/hooks/pr-review-loop.ts:215-216 — same: return decide('ignore', 'non_zero_exit', null, { exitCode }); — failures are explicitly ignored.
  • src/hooks/lib/gate-manager.ts:handleUnfinishedEscape() — writes .kaizen-deferred-items.json to state dir. This is the existing cross-session persistence pattern. A failure queue can follow the same pattern.
  • src/hooks/lib/gate-manager.ts:readDeferredItems() / clearDeferredItems() — read/clear lifecycle for deferred items at SessionStart. Failure queue read/clear will mirror this.
  • .claude/hooks/kaizen-check-wip.sh — already reads and displays .kaizen-deferred-items.json at SessionStart and clears it after display. The surfacing mechanism already exists; we extend it, not replace it.
  • src/hooks/session-telemetry.ts:emitSessionEvent() — JSONL append pattern, injectable telemetryDir, best-effort, never blocks. This is the right model for failure queue writes.
  • .claude-plugin/plugin.json:PostToolUse[Bash] — six PostToolUse hooks fire on every Bash call. Adding a seventh for failure capture is the right registration point.
  • src/hooks/session-cleanup.ts — second SessionStart hook after kaizen-check-wip.sh; cleans stale state files. Failure queue cleanup belongs in this hook or in kaizen-check-wip.sh.
  • Issue #934 confirmed: CWD-deleted session — Claude Code hard-checks stat(cwd) before every shell spawn. PreToolUse fires before the command, so when CWD is deleted the Bash tool call never executes — meaning PostToolUse never fires for those commands. This means the PostToolUse failure capture approach only covers exit_code != 0 failures (MODULE_NOT_FOUND, command errors), NOT the CWD-deleted scenario. CWD deletion is handled separately via the heartbeat/canary pattern proposed in the issue.

What does NOT exist that must be built:

  1. A capture-tool-failure.ts PostToolUse hook that writes exit_code != 0 events to a failure queue.
  2. A failure queue file format and read/write/clear API in gate-manager.ts (or a new failure-queue.ts).
  3. A SessionFailureEvent type in session-telemetry.ts (or inline in the new module).
  4. Extension of kaizen-check-wip.sh (or session-cleanup.ts) to display and clear the failure queue at SessionStart.
  5. A bash shim kaizen-capture-tool-failure-ts.sh and registration in plugin.json.
  6. A heartbeat file written each N tool calls (PreToolUse) and checked at SessionStart, to detect CWD-deleted silent deaths — this covers the dead-zone scenario that PostToolUse capture cannot reach.

Design Alternatives Considered

Design Question A: Where is the failure captured?

OPTION A1: PostToolUse hook that fires on every Bash tool call and writes failures when exit_code != 0 — SELECTED.

Why it works: PostToolUse already fires after every Bash tool call and receives exit_code, stderr, tool_input.command, and session_id in its JSON payload. A dedicated capture hook can filter for exit_code != 0 and append to a failure queue without affecting other hooks.

Failure mode if wrong: If Claude Code's Bash tool invocation in a failing state (MODULE_NOT_FOUND) does not deliver a PostToolUse event at all, the capture will never fire. However, based on issue #934's confirmed behavior (only CWD-deleted calls skip PostToolUse), MODULE_NOT_FOUND and other non-zero exit failures do fire PostToolUse — confirmed by pr-review-loop.ts deliberately ignoring them.

OPTION A2: At the stop gate — read hook exit codes and stderr retroactively — REJECTED.

Rejected because: by the time Stop fires, hooks have already returned; there is no mechanism to retroactively gather stderr from previously-run tool calls. The stop gate operates on gate state files, not on a transcript of tool executions. Additionally, sessions that are stuck (CWD deleted, gate deadlock) never reach Stop — the core dead-zone problem. A retroactive reader at Stop can only see what Stop can reach; it cannot recover from sessions that never reach Stop.

OPTION A3: A dedicated PreToolUse hook that wraps every hook call — REJECTED.

Rejected because: PreToolUse fires BEFORE the tool executes; it cannot observe the outcome of a tool call. It can only inspect the command to be run. Failure information (exit_code, stderr) is only available in PostToolUse. A PreToolUse failure-capture hook would need a second mechanism to correlate its pre-call observation with the post-call outcome, adding complexity for no gain over A1.

OPTION A4: Claude's stdout/stderr buffering — REJECTED.

Rejected because: hooks cannot reliably capture the orchestrating Claude process's stdout/stderr. The hook receives only what Claude Code passes in the JSON payload. There is no buffering API.

Design Question B: Where is the failure stored?

OPTION B1: A session-scoped JSONL file (e.g., $STATE_DIR/session-failures.jsonl) — SELECTED.

Why it works: The state dir (/tmp/.pr-review-state/) is already the canonical location for cross-hook ephemeral state. Files there survive hook restarts, are not committed, and can be read at SessionStart. JSONL (one JSON object per line) allows concurrent appends without file locking (append is atomic for small writes on Linux). The existing emitSessionEvent() pattern uses the same append-JSONL approach. Session ID from HookInput scopes each entry to its originating session.

Failure mode if wrong: If /tmp is cleared between sessions (reboot, tmpfs flush), failures from the prior session are lost before SessionStart can surface them. However: (a) the issue's most frequent scenario is same-machine session restart within hours, not cross-reboot; (b) KAIZEN_UNFINISHED uses the same /tmp location and has not been flagged for this; (c) the alternative (project-local file in .kaizen/) is a git-adjacent write from a hook in a worktree — dirty file detection hooks would flag it.

OPTION B2: New gate type in gate-manager.ts — REJECTED.

Rejected because: gate types (needs_review, needs_pr_kaizen, needs_post_merge) are workflow gates that block the Stop hook and require agent action to clear. Tool failures are not gates — they don't block work, they are advisory notifications that need to be filed as kaizen issues. Mixing gate semantics (blocking) with failure notification semantics (advisory) in the same data structure conflates two concepts and creates a risk that a high-volume of failures (a session with 20 erroring commands) would produce 20 stop gate blocks.

OPTION B3: Posted directly to a GitHub issue — REJECTED.

Rejected because: posting requires network access, GitHub auth, and a working gh CLI — none of which are guaranteed during the failure event (e.g., MODULE_NOT_FOUND suggests the TS/Node environment is broken, and git/CWD issues may preclude gh as well). Writing to a local file is robust to partial environment failures, whereas a network-dependent write would itself fail silently in the scenarios we're trying to capture.

OPTION B4: Written to stderr for the orchestrating claude process — REJECTED.

Rejected because: stderr from hooks is discarded (the trampoline uses 2>/dev/null). Even if it weren't, there is no mechanism to collect stderr from individual hooks and replay it at the next session start.

Design Question C: When/how does it surface to the admin?

OPTION C1: On next SessionStart via kaizen-check-wip.sh extension — SELECTED.

Why it works: kaizen-check-wip.sh already reads and displays .kaizen-deferred-items.json at session start, following the exact pattern we need. The admin (agent) sees unresolved failures immediately on the next session, with enough context (command, stderr, timestamp) to file kaizen issues. The agent is in a clean state and can act on the notifications immediately. The DEFERRED_ITEMS display pattern is already proven.

Failure mode if wrong: If the admin never starts a new session after a dead session (the worktree stays orphaned indefinitely), the failures are never surfaced. This is an acceptable limitation: the issue's goal is "close the loop after restart," not "capture failures in a dead session."

OPTION C2: Via kaizen-reflect at session end — REJECTED.

Rejected because: kaizen-reflect fires after gh pr create/merge — not at session end. It runs inside the gated reflection flow that requires a PR to exist. A session that dies mid-implementation (CWD deleted, MODULE_NOT_FOUND before PR creation) never triggers kaizen-reflect. This is exactly the dead zone the issue describes.

OPTION C3: Via a new PostToolUse hook after every tool call — REJECTED.

Rejected because: surfacing notifications via PostToolUse advisory output during the same session creates noise — the agent already sees the failure as a tool error in its context. The goal is cross-session persistence for failures that were not manually filed. PostToolUse is the right capture point, not the right surfacing point.

OPTION C4: Only when explicitly queried — REJECTED.

Rejected because: this recreates the current state (failures are already "visible" if you know where to look). The issue explicitly requires automated surfacing without manual user direction.

Design Question D: Heartbeat/canary for CWD-deleted dead sessions (separate from PostToolUse capture)

OPTION D1: PreToolUse hook writes a timestamp-based heartbeat file on every Bash call; SessionStart checks if a prior session's heartbeat is stale (older than N minutes) with no corresponding Stop event — SELECTED for dead-session detection.

Why it works: When CWD is deleted, PostToolUse never fires (Claude Code hard-checks stat(cwd) first per #934). A heartbeat written in PreToolUse (which fires before the tool call and before the CWD check) captures that the session was alive. If the next SessionStart finds a heartbeat file older than the session timeout with no corresponding clean stop, it knows to surface a "prior session may have died unexpectedly" notice.

Failure mode if wrong: If Claude Code also hard-checks CWD before PreToolUse (not just before executing the bash subprocess), heartbeats would also stop firing on CWD deletion. However: PreToolUse is a hook that Claude Code evaluates, not a shell spawn — the CWD check in #934 is specifically for the Bash tool shell spawn. PreToolUse hooks (TypeScript via trampoline) use an absolute path and don't depend on CWD. Acceptable uncertainty; if wrong, the heartbeat is a no-op, not a regression.

OPTION D2: Scope the minimum viable implementation to PostToolUse capture only; defer heartbeat — SELECTED for V1 scope.

The three concrete incidents from the issue body split into two categories:

  1. MODULE_NOT_FOUND and gate deadlock: PostToolUse fires, exit_code != 0. Covered by capture hook.
  2. CWD-deleted silent death: PostToolUse never fires. Requires heartbeat.

The heartbeat requires a new PreToolUse hook with state-dir writes, a session-scoped state file, and SessionStart correlation logic. This is a second independent mechanism. For V1, implement PostToolUse capture (addresses 2 of 3 incidents) and document the heartbeat as a follow-on (addresses 1 of 3).


Hypothesis

HYPOTHESIS: The proposed fix assumes that tool failures currently emit no persistent signal that survives session termination. If a PostToolUse hook captures failures with exit_code != 0 and writes them to a session-scoped file, and if SessionStart reads and displays those files, then failures that were previously invisible will be surfaced to the next session without manual user intervention.

VALIDATION: Empirically confirmed without building: (1) All PostToolUse hooks (kaizen-reflect.ts:323, pr-review-loop.ts:216, pr-kaizen-clear.ts:468) return early (return null or return decide('ignore', ...)) when exit_code != 0. Zero hooks currently do anything with failed tool calls. (2) There is no session-failures.jsonl or equivalent file anywhere in /tmp/.pr-review-state/ or .kaizen/. (3) Session telemetry (data/telemetry/events.jsonl) contains only session.stop_gate events — no failure events of any kind. (4) The issue's concrete incidents (MODULE_NOT_FOUND, gate deadlock) are confirmed to have required explicit user direction to file issues, consistent with zero automated capture.

IF WRONG: Evidence would be the existence of a failure queue file or a session.tool_failure event type in session-telemetry.ts that was missed in the survey. None found. The hypothesis is confirmed: failures are not captured anywhere.

Secondary hypothesis (dead-zone for CWD deletion): HYPOTHESIS: When a worktree's CWD is deleted, PostToolUse does not fire for subsequent Bash tool calls, so a PostToolUse-only capture approach cannot detect these session deaths.

VALIDATION: Confirmed by issue #934 analysis: "Claude Code hard-checks stat(cwd) before spawning shell" — the Bash tool invocation fails before execution begins, meaning PostToolUse never fires. This means the PostToolUse capture approach is necessary but not sufficient for the full dead-zone coverage; a heartbeat mechanism is required for CWD-deletion detection.


Implementation Plan

Phase 1: Failure Queue Infrastructure (addresses MODULE_NOT_FOUND and gate deadlock incidents)

Task 1.1 — Add SessionToolFailureEvent to session-telemetry.ts

Add a new event type session.tool_failure with fields: session_id, command (truncated to 200 chars), exit_code, stderr (truncated to 500 chars), timestamp. This type extension follows the existing SessionEvent union and does not change emitSessionEvent() behavior.

Traces to DONE WHEN: the failure queue entries are typed and structured, enabling SessionStart to render them with enough context to file a kaizen issue.

Task 1.2 — Create src/hooks/lib/failure-queue.ts

New module following the gate-manager.ts pattern. Exports:

  • appendToolFailure(entry: ToolFailureEntry, queueDir?: string): void — append one JSONL line; best-effort, never throws
  • readToolFailures(queueDir?: string): ToolFailureEntry[] — read and parse all entries; returns [] on error
  • clearToolFailures(queueDir?: string): void — unlink the queue file; best-effort

Storage: $STATE_DIR/session-failures.jsonl (injectable queueDir for test isolation).

ToolFailureEntry shape:

interface ToolFailureEntry {
  session_id: string;
  timestamp: string;
  command: string;      // truncated at 200 chars
  exit_code: number | string;
  stderr: string;       // truncated at 500 chars
  branch: string;
}

Traces to DONE WHEN: the storage API exists and is testable in isolation.

Task 1.3 — Create src/hooks/capture-tool-failure.ts

New PostToolUse hook. Logic:

  1. Read HookInput — if exit_code is 0 or undefined, exit immediately (fast path for the 99% case).
  2. If exit_code != 0: extract session_id, command (from tool_input.command), stderr (from tool_response.stderr), branch (from getCurrentBranch()).
  3. Call appendToolFailure(entry).
  4. Exit 0 (advisory, never blocks).

The hook deliberately does not inspect what the command was (no allowlist filtering) — every non-zero exit on Bash is a potential failure worth capturing. Noise is filtered at surfacing time (SessionStart), not at capture time.

Traces to DONE WHEN: failures are written to the queue after each non-zero exit Bash tool call.

Task 1.4 — Create bash shim .claude/hooks/kaizen-capture-tool-failure-ts.sh

Follow the trampoline pattern:

#!/bin/bash
source "$(dirname "$0")/lib/scope-guard.sh"
KAIZEN_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
exec npx --prefix "$KAIZEN_DIR" tsx "$KAIZEN_DIR/src/hooks/capture-tool-failure.ts" 2>/dev/null

Traces to DONE WHEN: the hook is invocable from Claude Code.

Task 1.5 — Register in .claude-plugin/plugin.json

Add to PostToolUse[Bash].hooks array (after the existing 6 hooks):

{
  "type": "command",
  "command": "${CLAUDE_PLUGIN_ROOT}/.claude/hooks/kaizen-capture-tool-failure-ts.sh",
  "timeout": 5
}

Timeout of 5 seconds (capture is a file append — should be <10ms; 5s is headroom for slow /tmp).

Traces to DONE WHEN: the hook fires during real sessions.

Task 1.6 — Extend kaizen-check-wip.sh to read and display failure queue

After the existing deferred-items display block, add:

FAILURE_QUEUE="${STATE_DIR:-/tmp/.pr-review-state}/session-failures.jsonl"
if [ -f "$FAILURE_QUEUE" ]; then
  FAILURE_COUNT=$(wc -l < "$FAILURE_QUEUE" 2>/dev/null || echo 0)
  if [ "$FAILURE_COUNT" -gt 0 ]; then
    echo "UNRESOLVED TOOL FAILURES from prior sessions ($FAILURE_COUNT):"
    echo "These failures were not filed as kaizen issues. File issues for each:"
    echo ""
    # Parse and display with jq (already a dependency in kaizen hooks)
    jq -r '"  [\(.timestamp)] exit=\(.exit_code) branch=\(.branch)\n  cmd: \(.command)\n  stderr: \(.stderr)\n"' \
      "$FAILURE_QUEUE" 2>/dev/null
    echo "File issues: gh issue create --repo \"$ISSUES_REPO\" ..."
    echo ""
    rm -f "$FAILURE_QUEUE"
  fi
fi

Note: the rm -f after display mirrors the .kaizen-deferred-items.json pattern. Once displayed, the agent is responsible for filing issues; the queue is cleared.

Traces to DONE WHEN: the admin sees failures on next SessionStart.

Phase 2: Tests (required before merge — 100% coverage of new code per CLAUDE.md)

Task 2.1 — src/hooks/lib/failure-queue.test.ts

Test file for failure-queue.ts. Invariants to cover:

  • appendToolFailure writes a valid JSONL line to the queue file
  • appendToolFailure creates the queue file if it does not exist
  • appendToolFailure appends (does not overwrite) on repeated calls
  • appendToolFailure is silent when queueDir path is unwriteable (best-effort)
  • readToolFailures returns [] when file does not exist
  • readToolFailures returns [] when file is malformed
  • readToolFailures returns all entries when file has N valid lines
  • clearToolFailures removes the file when it exists
  • clearToolFailures does not throw when file does not exist
  • Commands are truncated to 200 chars; stderr truncated to 500 chars

Seam: injectable queueDir parameter on all three functions — tests use mkdtempSync().

Task 2.2 — src/hooks/capture-tool-failure.test.ts

Test file for capture-tool-failure.ts. Extract the core logic into processCaptureInput(input: HookInput, options: {queueDir?, branch?}): boolean (returns true if a failure was captured, false if skipped). Tests:

  • Returns false (no write) when exit_code is 0
  • Returns false when exit_code is undefined
  • Returns true and writes entry when exit_code is 1
  • Returns true and writes entry when exit_code is non-zero string ("2", "127")
  • Captured entry contains session_id, command (truncated), stderr (truncated), branch
  • Returns false without throwing when queueDir is unwriteable

Seam: processCaptureInput() is a pure function with injected queueDir — no real filesystem needed.

Task 2.3 — src/hooks/session-telemetry.test.ts extension

Add tests for the new SessionToolFailureEvent type (that emitSessionEvent round-trips it correctly). This test file already exists; add a describe block.

Task 2.4 — Bash shim smoke test

Add to src/hooks/wrapper-smoke.test.ts (already tests that bash shims are executable):

  • Verify kaizen-capture-tool-failure-ts.sh exists and is executable
  • Verify it exits 0 when given a non-failure payload (fast-path test)

Testability Seam Map

Behavior Lives in file Test file Seam (injection point)
Append failure to queue src/hooks/lib/failure-queue.ts: appendToolFailure() src/hooks/lib/failure-queue.test.ts queueDir parameter — tests pass mkdtempSync() path
Read failure queue src/hooks/lib/failure-queue.ts: readToolFailures() src/hooks/lib/failure-queue.test.ts queueDir parameter — tests pre-populate temp dir
Clear failure queue src/hooks/lib/failure-queue.ts: clearToolFailures() src/hooks/lib/failure-queue.test.ts queueDir parameter
Filter and capture non-zero exits src/hooks/capture-tool-failure.ts: processCaptureInput() (extracted) src/hooks/capture-tool-failure.test.ts options.queueDir and options.branch injected — no real git or FS
Display failures at SessionStart .claude/hooks/kaizen-check-wip.sh .claude/hooks/tests/test-check-wip.sh (extend or create) STATE_DIR env var — tests set it to a temp dir
Truncate long command/stderr src/hooks/lib/failure-queue.ts: appendToolFailure() src/hooks/lib/failure-queue.test.ts Direct function call with long strings
Register hook in plugin .claude-plugin/plugin.json src/hooks/bash-ts-parity.test.ts (extend) File read + JSON parse — no seam needed

Red flag check: capture-tool-failure.ts main() is a CLI entry point — if the core logic is inline in main(), it would require mocking process.stdin and filesystem. The plan explicitly adds an extraction step (Task 2.2's processCaptureInput) before the test is written. The extraction is not optional.


What Was Ruled Out and Why

Heartbeat/canary pattern for CWD-deleted sessions: The issue proposes this as a fix for the dead-zone where Stop never fires. The survey confirmed it is necessary for one of the three concrete incident types (CWD deleted). It was ruled out of V1 scope because: (1) it requires a PreToolUse hook (new hook position — different registration than PostToolUse) writing a heartbeat file on every single Bash call, plus SessionStart correlation logic, plus a "stale heartbeat" detector with session timeout heuristics; (2) the two higher-priority incidents (MODULE_NOT_FOUND, gate deadlock) are covered by PostToolUse capture; (3) the heartbeat adds write overhead to every Bash PreToolUse call (currently 8 hooks fire on every Bash PreToolUse). File a follow-on issue referencing this plan.

Posting directly to GitHub on failure: Rejected (Design Question B, Option B3) — network writes from inside a failing environment are themselves likely to fail, and the hook must remain best-effort.

Adding a new gate type to gate-manager.ts: Rejected (Design Question B, Option B2) — failures are notifications, not blocking gates. Adding failure entries to the gate report would block Stop for every non-zero command exit, including routine ones like grep returning exit code 1.

Filtering which commands count as "real" failures: Decided against. A filter (e.g., only capture if exit_code >= 2, or only commands matching certain patterns) would be a policy decision that is wrong for some failures and correct for others. The SessionStart display can include the command and stderr so the agent can determine relevance when filing issues. Noise at capture is acceptable; filtering at display is better.

Using the existing hooks.jsonl (hook-telemetry.sh) for failure storage: hooks.jsonl captures hook execution performance (hook exit codes, not tool exit codes). The hook exit codes it records are whether the kaizen hook itself succeeded, not whether the Bash command the agent ran succeeded. These are different things. A separate session-failures.jsonl is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment