Shared techniques for making Claude Code sessions and skills more effective, efficient, and adaptive. Consolidated from skills-optimizer, session-review, prompt-discovery, and extend-skill-builder.
A system is viable when it sustains three properties simultaneously:
| Dimension | Question | Failure mode |
|---|---|---|
| Effective | Is it achieving the right goals? | Doing the wrong thing (misunderstood intent, wrong approach) |
| Efficient | Is it consuming only what's necessary? | Doing the right thing expensively (token waste, wrong tools) |
| Adaptive | Is it learning and evolving? | Learning the wrong lessons — overfitting to noise, contradictory rules, ossifying on stale patterns |
Remove any leg and the system degrades. Effective but inefficient bankrupts you. Efficient but non-adaptive ossifies. Adaptive but ineffective optimizes the wrong thing.
The Plan-Act-Reflect loop sustains all three: Plan sets intention (effectiveness), Act executes through minimal-overhead tools (efficiency), Reflect evaluates and improves (adaptivity).
The single highest-ROI optimization. Before touching any prompt, audit for leaked plumbing.
Scripts are plumbing: fetch, transform, validate, count, parse, compare, extract, format. Deterministic. Zero token cost at runtime.
Prompts are intelligence: interpret, classify with ambiguity, create, decide with incomplete info, evaluate quality, synthesize meaning. Token-expensive but necessary.
The test: Could a domain expert define the complete mapping from inputs to outputs in advance? Two questions:
- Is the correct output fully determined by the input? (no external judgment needed)
- Could you enumerate the rules exhaustively? (not "would it be tedious" but "is it possible in principle")
Both yes → plumbing. Either no → intelligence.
This handles edge cases well: regex-for-meaning fails not because it's nondeterministic, but because you can't enumerate the mapping — natural language meaning isn't a closed set. Format templates pass because the mapping is fully specifiable even if it's complex. The secondary check "could you write a unit test?" is useful but needs sharpening: the real version is "could you write tests that cover ALL valid inputs" — for intelligence work, the answer is always no.
- Format templates the model is rendering that a script could produce (e.g.,
--format markdownflag) - Data transformation: parsing structured data (NDJSON, JSON, CSV) and reformatting it
- Aggregation: counting items, computing dates/ages, summarizing data
- Inline
python3 -ccalls that repeat the same shape 2+ times - CLI wrappers: inline Python that wraps
gh,jq,curlis double waste
- Anything requiring judgment, interpretation, or context-dependent decisions
- Domain framing, "why" explanations, and examples (these enable judgment)
- Classification of meaning (regex for meaning = worst of both worlds)
- NDJSON on stdout, structured errors on stderr, exit codes 0/1/2
--helpas a concise contract- Zero interactivity, no decoration on stdout
- One script does one thing well
Expected impact: 10-20% cost reduction. In tested optimizations, plumbing extraction consistently outperformed all prompt compression combined.
Context loads in tiers. Each tier costs more. Push knowledge to the cheapest tier that works.
| Tier | When it loads | Cost | Examples |
|---|---|---|---|
| Always | Every turn | Highest | System prompt, tool schemas, CLAUDE.md, MEMORY.md (first 200 lines), skill descriptions |
| On demand | When triggered | Medium | SKILL.md body, .claude/rules/*.md, memory topic files |
| Isolated | Never enters main context | Zero | Subagent context (only summary returns), worktree file state |
- Push down tiers. If it doesn't need to be in every turn, move it out of CLAUDE.md. If it doesn't need to be in the main context, delegate to a subagent.
- Delegate reading before analysis. If you're about to read files then analyze, delegate the reading. Parent stays lean.
- Subagents are isolated context windows. A subagent can read 50 files and the parent only sees the summary. This is their primary value.
- CLAUDE.md is the most expensive real estate. Keep it under 200 lines. Move domain-specific rules to skills or path-scoped rules (
.claude/rules/*.md). - MEMORY.md is a routing index. Never inline detailed patterns. Point to topic files.
- Skills cost 0 tokens when idle. The most efficient place to store domain expertise.
You cannot predict how an LLM will interpret instructions. The only way to know what a prompt does is to run it.
- Start with examples. 2-3 realistic scenarios from the user.
- Derive success criteria from examples (don't ask users to define them abstractly).
- Write the minimal seed. Role + core task + one output example if format matters. Nothing else.
- Test all cases. Note HOW each failure occurs.
- Fix one failure with the minimum addition. Rerun ALL tests.
- Repeat until all pass. Most converge in 1-3 rounds.
- Compress. Delete each instruction one at a time. Still pass? Keep it deleted.
Skills should tell the model what to achieve and why, not how step-by-step. The model already knows how to use most tools — your job is to define the goal and the reasoning behind constraints. This aligns with Anthropic's skill-builder approach: outcome-driven prompts with context on why.
"Why" explanations are load-bearing. Overview sections and domain framing in skills aren't filler — they enable the model to exercise judgment on edge cases it hasn't seen. A model that knows "we do X because of compliance requirement Y" can generalize to novel situations. A model that only knows "do X" can't. When optimizing for efficiency, compress procedures first and protect the "why."
Skills that survived optimization consistently kept three things: outcomes (what success looks like), why (reasoning behind constraints), and guardrails (non-obvious gotchas discovered through test failures). Everything else was removable.
- Remove instructions that teach the model things it already knows (e.g., how to use
gh,curl,jq) - Replace verbose step-by-step procedures with outcome statements (keep the "why", cut the "how")
- Remove defensive padding ("Make sure to...", "Remember to...", "It is important that...")
- Compress repetitive instructions into a single rule
- Remove examples the model doesn't need to produce correct output
- Rewriting the entire skill from scratch
- Adding 10 new rules at once
- Adding vague instructions ("be more creative", "handle appropriately")
- Removing domain framing, overview sections, or "why" explanations — these are load-bearing, not filler. They enable judgment on edge cases and generalization to novel situations. Compress procedures, protect the "why."
- Compressing format templates (confirmed load-bearing across multiple skills -- the model needs the full example to reproduce the format)
- RFC modal verbs: MUST, SHOULD, MAY, MUST NOT (activates specification-class training data)
- Numbered lists over prose for multi-step instructions
- Imperative voice: "Read the file" not "You should read the file"
- Specificity over abstraction: exact paths, exact formats, exact names
- Positives over negatives: "Always use snake_case" not "Don't use camelCase" (negation is weaker than assertion -- "don't do X" makes X the most salient token)
| Pattern | Problem |
|---|---|
| Wall of Text | 500-line prompt before any testing. Contains contradictions and instructions the model would follow anyway. |
| Defensive prompting | "Make sure to...", "Don't forget to..." signals anxiety, not clarity. |
| Cargo-culting | Copying patterns that worked elsewhere without testing if needed here. |
| Explaining the model to itself | "You are an LLM that processes text..." -- the model knows. Say what to DO. |
| Premature abstraction | Handling 10 scenarios when you've encountered 2. Handle the 2. |
| Success criteria bloat | Too many criteria program the exact path, eliminating the model's ability to find creative solutions. |
Model selection is a forked search, not a single swap. A single search can't explore model+prompt combinations without bundling (loses isolation) or sequencing (hits local maxima where an Opus-optimized prompt fails on Sonnet not because Sonnet can't, but because the prompt was tuned for a different model).
- Optimize prompt fully on current model first. Get it stable and converged.
- Fork per candidate model. Run independent optimization loops — each model gets its own prompt adaptations.
- Bail early if baseline pass rate < 70% on a candidate model.
- Compare converged results across forks. Best cost/effectiveness tradeoff wins. Compare WHICH evals fail, not just how many — failure mode shifts matter.
- Haiku warning: Qualitatively different failure modes — hallucinated tool names, skipped multi-step reasoning, missed conditional logic. It's not "cheaper Sonnet."
- Ship it: Add
model: <model>to SKILL.md frontmatter. Per-turn routing -- only skill invocations use the cheaper model.
No scales, no vibes. Every check is pass or fail. Scales compound variability. 3-6 evals is the sweet spot -- more than that and the skill starts gaming the checklist.
Never bundle unrelated changes. If the result regresses, you can't isolate which change caused it. A hypothesis can involve coupled changes (model + prompt are a single unit), but plumbing extraction + a behavioral fix are two separate experiments.
Never trade pass rate for tokens. Pass rate drop > 5% = automatic discard. A skill that's correct and expensive is better than one that's cheap and wrong.
Never characterize waste as "minor", "small", or "acceptable." Count the wasted calls. State the token cost. Propose the fix. Let the user decide severity.
| Waste Type | How to report |
|---|---|
| 1-2 wasted tool calls | Count them, name them, propose the fix |
| 3+ wasted tool calls | Flag prominently with approximate token cost |
| Wasted conversation turn | Highest severity -- user time is the scarcest resource |
| Redundant file reads | Count occurrences, estimate tokens wasted |
| Wrong tool for the job | Name the correct tool, count occurrences |
Route learnings to the cheapest, most appropriate layer.
| Priority | Layer | Cost | Use when |
|---|---|---|---|
| 1 | Script | Zero (deterministic) | Repeatable plumbing, data transformation |
| 2 | Skill | Zero when idle | Reusable workflow, domain expertise (3+ occurrences) |
| 3 | Hook | Zero (OS-level) | Rule that keeps being violated despite advisory |
| 4 | Path-scoped rule | Zero when out of scope | Rules specific to file types or directories |
| 5 | CLAUDE.md | High (always loaded) | Cross-cutting rules for every session |
| 6 | Memory topic file | Medium (on demand) | Facts, preferences, patterns that don't fit elsewhere |
| 7 | MEMORY.md | High (always loaded) | Routing pointers only, never detailed patterns |
Routing questions:
| Question | If yes |
|---|---|
| Deterministic operation? | Script |
| Applies to every session? | CLAUDE.md |
| Specific to a domain with an existing skill? | Skill update |
| Rule that keeps being violated? | Hook |
| Specific to certain file types? | Path-scoped rule |
| Everything else? | Memory topic file |
Every rule, skill, hook, and memory entry increases system complexity. A system that only accumulates exceeds its own capacity to regulate itself.
- Subsumption. A new learning generalizes over existing rules. Three specific rules become one principle. The count goes down.
- Delegation. Moving knowledge into a skill hides complexity from the central system. CLAUDE.md doesn't need details, just when to invoke.
- Pruning. Rules the model follows without needing the instruction get removed. Memory encoded into skills gets deleted. Hooks guarding solved problems get retired.
- Composition. One rule handling a class of problems beats ten rules handling ten cases. Prefer general principles over specific instructions.
The healthiest optimization output may be a net reduction in system complexity.
For every finding, ask "why?" at least twice:
- Surface: "Ran a nonexistent command" -> Why? "Guessed instead of reading docs" -> Why? "Skill docs reference a command that was never built" -> Fix: update the skill docs
- If the root cause is a broken tool, wrong docs, or missing automation, the fix is code/config -- not a note about being more careful.
After cataloging findings, ask: "Do any share a root cause?" Group by shared upstream cause. If N symptoms trace to one cause, generate one fix -- not N. Fewer, higher-leverage fixes beat many scattered patches.
Extended thinking consumes output-priced tokens. Control it.
- Effort-level overrides: Set
effort: lowin a skill's frontmatter to suppress extended thinking for mechanical tasks (scripts, utilities, classification). Thinking tokens can be 10k-50k per request; suppressing them halves response cost. - Plan mode shifts cost: Plan mode uses read-only exploration without extended thinking charges. Shifts thinking cost from exploration (cheap) to implementation (where it matters).
- Skill-level effort overrides session effort for that invocation only. Mechanical skills should almost always be
effort: low.
/clearbetween unrelated tasks. Stale context from previous work wastes tokens on every subsequent message./renamebefore clearing to save the session, then/resumelater if needed.- Particularly effective after context-heavy work (large PRs) followed by quick tasks.
- Custom compaction instructions:
/compact Focus on code samples and API usagetells Claude what to preserve during summarization. Add a# Compact instructionssection to CLAUDE.md to make it permanent. - Early compaction:
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50triggers compaction at 50% instead of 95% for proactive management. - Subagent compaction is independent. Subagent compaction doesn't affect the main conversation. Long-running researcher subagents can explore freely without dragging down your context.
CLAUDE.md is cached after first load and reused across messages in the same session. For multi-turn conversations, the amortized cost of instruction loading approaches zero. Subagents of the same type benefit from prefix caching across invocations — the stable system prompt prefix hits the cache on repeat calls.
Issue multiple Read/Glob/Grep calls in a single response instead of bundling into one Bash command. Same per-turn cost, avoids shell escaping, and lets the model inspect intermediate results. When a file's location is uncertain, fire all plausible Globs in parallel — never sequentially.
PreToolUse hooks can filter large outputs before they enter context. A hook on Bash that filters test output to show only failures reduces context by 20-50x. Filtering at the tool boundary means Claude never sees noise — hook cost is low-level, not model-context cost.
When MCP tool descriptions exceed 10% of context, Claude auto-defers them. Set ENABLE_TOOL_SEARCH=auto:5 to trigger deference at 5% for early relief. Deferred tools only load when actually referenced.
gh, aws, gcloud don't consume persistent context like MCP tool definitions do. MCP tools load their definitions into every message even when idle. Prefer CLI tools for one-off operations.
Include test cases, expected output, or screenshots in your prompt upfront. Claude can verify its own work immediately instead of iterating. Early catch saves tokens on re-runs and compaction. Specific prompts ("Add input validation to login function in auth.ts") trigger focused exploration; vague ones ("Improve this codebase") trigger broad scanning — specificity reduces the file search/read footprint by 5-10x.
Subagents are completely isolated context windows — separate API requests, separate KV caches, no shared conversation history. Only the prompt you pass flows in; only the final message flows back. A subagent can read 50 files and the parent only sees the summary.
- Haiku: classification, filtering, simple analysis (~1/3 cost of Sonnet)
- Sonnet: coordination, synthesis, structured output
- Opus: only when the subagent needs deep reasoning
tools: Read, Grep, Glob in subagent definition prevents Claude from using Edit/Write/Bash. Faster execution, lower cost, clearer semantics than relying on permission prompts.
skills: [convention-guide, api-patterns] in subagent frontmatter injects skill content at startup. Gives domain knowledge without mid-execution discovery overhead.
Each teammate runs its own context window. Cost scales ~7x with team size. Keep teams small (2-3 members) and scope tasks narrowly. Active teammates consume tokens even when idle — no automatic timeout.
Define .claude/rules/frontend.md with matching paths so API rules don't load when editing React, and vice versa. Zero cost when you're not working in their scope. Same file can target multiple path patterns. Move verbose domain knowledge from CLAUDE.md to path-scoped rules to reduce baseline token cost.
When optimizing a skill or session, work these levers in order of typical ROI:
- Plumbing extraction -- move deterministic work to scripts (10-20% cost reduction)
- Thinking control -- effort-level overrides on mechanical skills (40-60% per invocation)
- Context tier management -- push knowledge down to cheaper tiers
- Prompt compression -- stabilize the prompt on the current model
- Model selection -- fork the search per candidate model (40-60% for mechanical tasks, requires stable prompt)
- Output verbosity -- reduce unnecessary output tokens
Model selection comes after prompt stability because it's a forked search — you need a converged prompt as the starting point for each model's optimization branch.