Optimization Playbook

Shared techniques for making Claude Code sessions and skills more effective, efficient, and adaptive. Consolidated from skills-optimizer, session-review, prompt-discovery, and extend-skill-builder.

1. The Viability Model

A system is viable when it sustains three properties simultaneously:

Dimension	Question	Failure mode
Effective	Is it achieving the right goals?	Doing the wrong thing (misunderstood intent, wrong approach)
Efficient	Is it consuming only what's necessary?	Doing the right thing expensively (token waste, wrong tools)
Adaptive	Is it learning and evolving?	Learning the wrong lessons — overfitting to noise, contradictory rules, ossifying on stale patterns

Remove any leg and the system degrades. Effective but inefficient bankrupts you. Efficient but non-adaptive ossifies. Adaptive but ineffective optimizes the wrong thing.

The Plan-Act-Reflect loop sustains all three: Plan sets intention (effectiveness), Act executes through minimal-overhead tools (efficiency), Reflect evaluates and improves (adaptivity).

2. Intelligence vs. Plumbing

The single highest-ROI optimization. Before touching any prompt, audit for leaked plumbing.

Scripts are plumbing: fetch, transform, validate, count, parse, compare, extract, format. Deterministic. Zero token cost at runtime.

Prompts are intelligence: interpret, classify with ambiguity, create, decide with incomplete info, evaluate quality, synthesize meaning. Token-expensive but necessary.

The test: Could a domain expert define the complete mapping from inputs to outputs in advance? Two questions:

Is the correct output fully determined by the input? (no external judgment needed)
Could you enumerate the rules exhaustively? (not "would it be tedious" but "is it possible in principle")

Both yes → plumbing. Either no → intelligence.

This handles edge cases well: regex-for-meaning fails not because it's nondeterministic, but because you can't enumerate the mapping — natural language meaning isn't a closed set. Format templates pass because the mapping is fully specifiable even if it's complex. The secondary check "could you write a unit test?" is useful but needs sharpening: the real version is "could you write tests that cover ALL valid inputs" — for intelligence work, the answer is always no.

What to extract

Format templates the model is rendering that a script could produce (e.g., --format markdown flag)
Data transformation: parsing structured data (NDJSON, JSON, CSV) and reformatting it
Aggregation: counting items, computing dates/ages, summarizing data
Inline python3 -c calls that repeat the same shape 2+ times
CLI wrappers: inline Python that wraps gh, jq, curl is double waste

What NOT to extract

Anything requiring judgment, interpretation, or context-dependent decisions
Domain framing, "why" explanations, and examples (these enable judgment)
Classification of meaning (regex for meaning = worst of both worlds)

Script conventions

NDJSON on stdout, structured errors on stderr, exit codes 0/1/2
--help as a concise contract
Zero interactivity, no decoration on stdout
One script does one thing well

Expected impact: 10-20% cost reduction. In tested optimizations, plumbing extraction consistently outperformed all prompt compression combined.

3. Context Budget Management

Context loads in tiers. Each tier costs more. Push knowledge to the cheapest tier that works.

Tier	When it loads	Cost	Examples
Always	Every turn	Highest	System prompt, tool schemas, CLAUDE.md, MEMORY.md (first 200 lines), skill descriptions
On demand	When triggered	Medium	SKILL.md body, `.claude/rules/*.md`, memory topic files
Isolated	Never enters main context	Zero	Subagent context (only summary returns), worktree file state

Techniques

Push down tiers. If it doesn't need to be in every turn, move it out of CLAUDE.md. If it doesn't need to be in the main context, delegate to a subagent.
Delegate reading before analysis. If you're about to read files then analyze, delegate the reading. Parent stays lean.
Subagents are isolated context windows. A subagent can read 50 files and the parent only sees the summary. This is their primary value.
CLAUDE.md is the most expensive real estate. Keep it under 200 lines. Move domain-specific rules to skills or path-scoped rules (.claude/rules/*.md).
MEMORY.md is a routing index. Never inline detailed patterns. Point to topic files.
Skills cost 0 tokens when idle. The most efficient place to store domain expertise.

4. Prompt Optimization Techniques

Prompts are discovered, not designed

You cannot predict how an LLM will interpret instructions. The only way to know what a prompt does is to run it.

Start with examples. 2-3 realistic scenarios from the user.
Derive success criteria from examples (don't ask users to define them abstractly).
Write the minimal seed. Role + core task + one output example if format matters. Nothing else.
Test all cases. Note HOW each failure occurs.
Fix one failure with the minimum addition. Rerun ALL tests.
Repeat until all pass. Most converge in 1-3 rounds.
Compress. Delete each instruction one at a time. Still pass? Keep it deleted.

Outcomes + Why > Procedures

Skills should tell the model what to achieve and why, not how step-by-step. The model already knows how to use most tools — your job is to define the goal and the reasoning behind constraints. This aligns with Anthropic's skill-builder approach: outcome-driven prompts with context on why.

"Why" explanations are load-bearing. Overview sections and domain framing in skills aren't filler — they enable the model to exercise judgment on edge cases it hasn't seen. A model that knows "we do X because of compliance requirement Y" can generalize to novel situations. A model that only knows "do X" can't. When optimizing for efficiency, compress procedures first and protect the "why."

Skills that survived optimization consistently kept three things: outcomes (what success looks like), why (reasoning behind constraints), and guardrails (non-obvious gotchas discovered through test failures). Everything else was removable.

Good prompt mutations (efficiency)

Remove instructions that teach the model things it already knows (e.g., how to use gh, curl, jq)
Replace verbose step-by-step procedures with outcome statements (keep the "why", cut the "how")
Remove defensive padding ("Make sure to...", "Remember to...", "It is important that...")
Compress repetitive instructions into a single rule
Remove examples the model doesn't need to produce correct output

Bad prompt mutations

Rewriting the entire skill from scratch
Adding 10 new rules at once
Adding vague instructions ("be more creative", "handle appropriately")
Removing domain framing, overview sections, or "why" explanations — these are load-bearing, not filler. They enable judgment on edge cases and generalization to novel situations. Compress procedures, protect the "why."
Compressing format templates (confirmed load-bearing across multiple skills -- the model needs the full example to reproduce the format)

Language patterns that work

RFC modal verbs: MUST, SHOULD, MAY, MUST NOT (activates specification-class training data)
Numbered lists over prose for multi-step instructions
Imperative voice: "Read the file" not "You should read the file"
Specificity over abstraction: exact paths, exact formats, exact names
Positives over negatives: "Always use snake_case" not "Don't use camelCase" (negation is weaker than assertion -- "don't do X" makes X the most salient token)

Anti-patterns

Pattern	Problem
Wall of Text	500-line prompt before any testing. Contains contradictions and instructions the model would follow anyway.
Defensive prompting	"Make sure to...", "Don't forget to..." signals anxiety, not clarity.
Cargo-culting	Copying patterns that worked elsewhere without testing if needed here.
Explaining the model to itself	"You are an LLM that processes text..." -- the model knows. Say what to DO.
Premature abstraction	Handling 10 scenarios when you've encountered 2. Handle the 2.
Success criteria bloat	Too many criteria program the exact path, eliminating the model's ability to find creative solutions.

5. Model Selection (Forked Search)

Model selection is a forked search, not a single swap. A single search can't explore model+prompt combinations without bundling (loses isolation) or sequencing (hits local maxima where an Opus-optimized prompt fails on Sonnet not because Sonnet can't, but because the prompt was tuned for a different model).

Optimize prompt fully on current model first. Get it stable and converged.
Fork per candidate model. Run independent optimization loops — each model gets its own prompt adaptations.
Bail early if baseline pass rate < 70% on a candidate model.
Compare converged results across forks. Best cost/effectiveness tradeoff wins. Compare WHICH evals fail, not just how many — failure mode shifts matter.
Haiku warning: Qualitatively different failure modes — hallucinated tool names, skipped multi-step reasoning, missed conditional logic. It's not "cheaper Sonnet."
Ship it: Add model: <model> to SKILL.md frontmatter. Per-turn routing -- only skill invocations use the cheaper model.

6. Measurement Discipline

Binary evals only

No scales, no vibes. Every check is pass or fail. Scales compound variability. 3-6 evals is the sweet spot -- more than that and the skill starts gaming the checklist.

One hypothesis at a time

Never bundle unrelated changes. If the result regresses, you can't isolate which change caused it. A hypothesis can involve coupled changes (model + prompt are a single unit), but plumbing extraction + a behavioral fix are two separate experiments.

Effectiveness > efficiency

Never trade pass rate for tokens. Pass rate drop > 5% = automatic discard. A skill that's correct and expensive is better than one that's cheap and wrong.

Quantify, don't qualify

Never characterize waste as "minor", "small", or "acceptable." Count the wasted calls. State the token cost. Propose the fix. Let the user decide severity.

Waste Type	How to report
1-2 wasted tool calls	Count them, name them, propose the fix
3+ wasted tool calls	Flag prominently with approximate token cost
Wasted conversation turn	Highest severity -- user time is the scarcest resource
Redundant file reads	Count occurrences, estimate tokens wasted
Wrong tool for the job	Name the correct tool, count occurrences

7. Where Knowledge Belongs

Route learnings to the cheapest, most appropriate layer.

Priority	Layer	Cost	Use when
1	Script	Zero (deterministic)	Repeatable plumbing, data transformation
2	Skill	Zero when idle	Reusable workflow, domain expertise (3+ occurrences)
3	Hook	Zero (OS-level)	Rule that keeps being violated despite advisory
4	Path-scoped rule	Zero when out of scope	Rules specific to file types or directories
5	CLAUDE.md	High (always loaded)	Cross-cutting rules for every session
6	Memory topic file	Medium (on demand)	Facts, preferences, patterns that don't fit elsewhere
7	MEMORY.md	High (always loaded)	Routing pointers only, never detailed patterns

Routing questions:

Question	If yes
Deterministic operation?	Script
Applies to every session?	CLAUDE.md
Specific to a domain with an existing skill?	Skill update
Rule that keeps being violated?	Hook
Specific to certain file types?	Path-scoped rule
Everything else?	Memory topic file

8. Complexity Management

Every rule, skill, hook, and memory entry increases system complexity. A system that only accumulates exceeds its own capacity to regulate itself.

Four mechanisms for reducing complexity

Subsumption. A new learning generalizes over existing rules. Three specific rules become one principle. The count goes down.
Delegation. Moving knowledge into a skill hides complexity from the central system. CLAUDE.md doesn't need details, just when to invoke.
Pruning. Rules the model follows without needing the instruction get removed. Memory encoded into skills gets deleted. Hooks guarding solved problems get retired.
Composition. One rule handling a class of problems beats ten rules handling ten cases. Prefer general principles over specific instructions.

The healthiest optimization output may be a net reduction in system complexity.

9. Root-Cause Analysis

For every finding, ask "why?" at least twice:

Surface: "Ran a nonexistent command" -> Why? "Guessed instead of reading docs" -> Why? "Skill docs reference a command that was never built" -> Fix: update the skill docs
If the root cause is a broken tool, wrong docs, or missing automation, the fix is code/config -- not a note about being more careful.

Root-cause convergence

After cataloging findings, ask: "Do any share a root cause?" Group by shared upstream cause. If N symptoms trace to one cause, generate one fix -- not N. Fewer, higher-leverage fixes beat many scattered patches.

10. Thinking & Effort Management

Extended thinking consumes output-priced tokens. Control it.

Effort-level overrides: Set effort: low in a skill's frontmatter to suppress extended thinking for mechanical tasks (scripts, utilities, classification). Thinking tokens can be 10k-50k per request; suppressing them halves response cost.
Plan mode shifts cost: Plan mode uses read-only exploration without extended thinking charges. Shifts thinking cost from exploration (cheap) to implementation (where it matters).
Skill-level effort overrides session effort for that invocation only. Mechanical skills should almost always be effort: low.

11. Session & Compaction Management

Context clearing

/clear between unrelated tasks. Stale context from previous work wastes tokens on every subsequent message.
/rename before clearing to save the session, then /resume later if needed.
Particularly effective after context-heavy work (large PRs) followed by quick tasks.

Compaction control

Custom compaction instructions: /compact Focus on code samples and API usage tells Claude what to preserve during summarization. Add a # Compact instructions section to CLAUDE.md to make it permanent.
Early compaction: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50 triggers compaction at 50% instead of 95% for proactive management.
Subagent compaction is independent. Subagent compaction doesn't affect the main conversation. Long-running researcher subagents can explore freely without dragging down your context.

Prefix caching

CLAUDE.md is cached after first load and reused across messages in the same session. For multi-turn conversations, the amortized cost of instruction loading approaches zero. Subagents of the same type benefit from prefix caching across invocations — the stable system prompt prefix hits the cache on repeat calls.

12. Tool & Exploration Patterns

Parallel tool calls

Issue multiple Read/Glob/Grep calls in a single response instead of bundling into one Bash command. Same per-turn cost, avoids shell escaping, and lets the model inspect intermediate results. When a file's location is uncertain, fire all plausible Globs in parallel — never sequentially.

Hooks as preprocessors

PreToolUse hooks can filter large outputs before they enter context. A hook on Bash that filters test output to show only failures reduces context by 20-50x. Filtering at the tool boundary means Claude never sees noise — hook cost is low-level, not model-context cost.

Tool search thresholds

When MCP tool descriptions exceed 10% of context, Claude auto-defers them. Set ENABLE_TOOL_SEARCH=auto:5 to trigger deference at 5% for early relief. Deferred tools only load when actually referenced.

CLI tools over MCP servers

gh, aws, gcloud don't consume persistent context like MCP tool definitions do. MCP tools load their definitions into every message even when idle. Prefer CLI tools for one-off operations.

Verification targets

Include test cases, expected output, or screenshots in your prompt upfront. Claude can verify its own work immediately instead of iterating. Early catch saves tokens on re-runs and compaction. Specific prompts ("Add input validation to login function in auth.ts") trigger focused exploration; vague ones ("Improve this codebase") trigger broad scanning — specificity reduces the file search/read footprint by 5-10x.

13. Subagent Patterns

Context isolation

Subagents are completely isolated context windows — separate API requests, separate KV caches, no shared conversation history. Only the prompt you pass flows in; only the final message flows back. A subagent can read 50 files and the parent only sees the summary.

Model selection per subagent

Haiku: classification, filtering, simple analysis (~1/3 cost of Sonnet)
Sonnet: coordination, synthesis, structured output
Opus: only when the subagent needs deep reasoning

Tool restriction

tools: Read, Grep, Glob in subagent definition prevents Claude from using Edit/Write/Bash. Faster execution, lower cost, clearer semantics than relying on permission prompts.

Preload skills via frontmatter

skills: [convention-guide, api-patterns] in subagent frontmatter injects skill content at startup. Gives domain knowledge without mid-execution discovery overhead.

Agent teams cost scaling

Each teammate runs its own context window. Cost scales ~7x with team size. Keep teams small (2-3 members) and scope tasks narrowly. Active teammates consume tokens even when idle — no automatic timeout.

14. Path-Scoped Rules

Define .claude/rules/frontend.md with matching paths so API rules don't load when editing React, and vice versa. Zero cost when you're not working in their scope. Same file can target multiple path patterns. Move verbose domain knowledge from CLAUDE.md to path-scoped rules to reduce baseline token cost.

15. The Optimization Order

When optimizing a skill or session, work these levers in order of typical ROI:

Plumbing extraction -- move deterministic work to scripts (10-20% cost reduction)
Thinking control -- effort-level overrides on mechanical skills (40-60% per invocation)
Context tier management -- push knowledge down to cheaper tiers
Prompt compression -- stabilize the prompt on the current model
Model selection -- fork the search per candidate model (40-60% for mechanical tasks, requires stable prompt)
Output verbosity -- reduce unnecessary output tokens

Model selection comes after prompt stability because it's a forked search — you need a converged prompt as the starting point for each model's optimization branch.

turlockmike/optimization-playbook.md