Hidden Token Costs in Claude Code: How to Audit Your Cache Efficiency

You might be burning through your Claude Code usage faster than you think. After digging into Claude Code's local stats file, I traced a major silent token sink that most users don't know about.

The Discovery

Claude Code stores per-project session stats in ~/.claude.json.tmp.*. This file contains token breakdowns including cache hit/miss ratios that reveal how efficiently you're using your subscription.

By cross-referencing these stats with session transcripts, I found the culprit behind anomalous sessions with poor cache performance.

The Cache Killer: Auto-Memory (MEMORY.md)

Claude Code's auto-memory system writes to ~/.claude/projects/<project>/memory/MEMORY.md, which gets embedded in your prompt. The problem:

Any session that writes a memory invalidates the cache prefix for all other active sessions on that project
You don't control when Claude decides to write a memory — it happens silently mid-conversation
Power users accumulate large memory files — I had 99 files across 16 projects (~94KB total)
Every turn in every session pays for this content whether you use it or not

How I found it

One session showed a cache invalidation event during a 43-minute idle gap. During that gap, a parallel Claude Code session wrote new entries to MEMORY.md. When the first session resumed, its cache prefix no longer matched — every subsequent turn was a cold read.

This is especially brutal if you run multiple terminals. One session's innocent memory write silently taxes every other active session.

How to Audit Your Sessions

Find your stats file

ls ~/.claude.json.tmp.*

Extract cache ratios per project

cat ~/.claude.json.tmp.* | python3 -c "
import json, sys
data = json.load(sys.stdin)
for path, info in data.get('projects', {}).items():
    reads = info.get('lastTotalCacheReadInputTokens', 0)
    creates = info.get('lastTotalCacheCreationInputTokens', 0)
    raw = info.get('lastTotalInputTokens', 0)
    total = reads + creates + raw
    if total == 0: continue
    hit_rate = reads / total * 100
    cost = info.get('lastCost', 0)
    print(f'{hit_rate:5.1f}% cache hit | \${cost:7.2f} | {raw:>10,} raw input | {path}')
    for model, usage in info.get('lastModelUsage', {}).items():
        m_reads = usage.get('cacheReadInputTokens', 0)
        m_raw = usage.get('inputTokens', 0)
        m_creates = usage.get('cacheCreationInputTokens', 0)
        m_total = m_reads + m_raw + m_creates
        m_rate = m_reads / m_total * 100 if m_total else 0
        print(f'       {m_rate:5.1f}% {model}: {m_raw:>10,} raw, {m_reads:>12,} cached')
" 2>/dev/null | sort -t'|' -k1 -n

What to look for

Red flags:

Cache hit rate below 80% on Opus
High cacheCreationInputTokens with low cacheReadInputTokens — cache is being rebuilt, not reused
Sessions with high cost but low output — something went wrong

Healthy session:

95%+ cache hit rate on Opus
Cache reads >> cache creates

Investigate bad sessions

Match lastSessionId from the stats to the transcript at:

~/.claude/projects/<mangled-project-path>/<session-id>.jsonl

The Fix

1. Disable auto-memory

Add to ~/.claude/CLAUDE.md (global, applies to all projects):

## Auto-Memory Disabled
Do not use auto-memory. Do not create or write to memory files.

Then clean up existing memories:

rm -rf ~/.claude/projects/*/memory/

2. Keep sessions short and focused

Instead of one long session that accumulates context and risks cache invalidation:

Break work into discrete tasks
Execute each in a focused session or subagent
Each gets a clean, stable cache prefix

Haiku Subagents Don't Get Tool Search

The stats file contains a feature flag: "tengu_tool_search_unsupported_models": ["haiku"]. Tool search is the mechanism that defers loading MCP tool schemas until they're actually needed — instead of stuffing every tool definition into the prompt upfront.

Haiku doesn't get this optimization. If you have MCP servers with many tools registered at user scope, every Haiku subagent pays for the full tool schemas in its context. Scope tool-heavy MCP servers to specific projects if only those projects need them.

Other Known Cache Invalidators

These are less impactful than auto-memory but worth knowing:

Switching models mid-session (/model) — different cache namespace
Toggling thinking mode — may change prompt shape
Resuming a session after 1+ hours — cache TTL expires
Editing settings/MCP configs mid-session — anything loaded into the prompt prefix (settings.json, MCP server registrations, CLAUDE.md) will invalidate the cache if changed while a session is active
/clear — same as starting a new session (cache warms up from scratch). Not inherently bad, just don't do it unnecessarily early in a session

TL;DR

Check ~/.claude.json.tmp.* for your cache hit rates. If they're bad, the most likely cause is auto-memory silently rewriting your prompt prefix — especially if you run parallel sessions. Disable it, delete the memory files, and your cache efficiency should improve immediately.

GGPrompts/cache-audit-guide.md

Select an option

No results found