Harness Comparison Matrix

Date: 2025-12-06 Last Audit: 2025-12-08 Purpose: Compare AI coding agent harnesses for Squad integration requirements Status: 2 of N harnesses analyzed (Claude Code, Codex CLI)

Self-Update Protocol

INSTRUCTION TO CLAUDE: When you verify your own capabilities or discover changes to your harness, you MUST update this document. This is a living comparison matrix, not a static snapshot.

When to Update This Document

After any capability audit — If you verify claims in claude-code-vs-squad-harness-user-stories.md
When you discover new tools — Check if they affect any rating in this matrix
When you hit a limitation — Document it as evidence
When you successfully use a capability — Update if rating was too low

How to Update

Change the Last Audit date in the header
Update the relevant rating in the comparison tables
Add an entry to the Changelog at the bottom
Cite evidence — tool name, parameter, or test result

Evidence Requirements

Each rating change must include:

Tool/Parameter: What capability enables this?
Test Result: Did you verify it works?
Rating Change: Old → New with justification

Executive Summary

Dimension	Claude Code	Codex CLI	Factory Droid	Squad Vision
Model	Claude Opus 4.5	GPT-5-based	Claude Sonnet 4.5	Any (harness-agnostic)
Subagent Spawning	✅ Yes (Task tool)	❌ No	❌ No	✅ Yes (persistent)
Subagent Persistence	❌ Ephemeral (resume FAILS)	N/A	N/A	✅ Persistent
True Parallelism	✅ Background (verified)	⚠️ Limited (multi-execute)	⚠️ Limited (parallel tool calls)	✅ True async
Memory Across Sessions	❌ No (resume FAILS)	❌ No (MCP optional)	❌ No	✅ Checkpoints
Context Refresh	❌ Stale accumulation	❌ Stale accumulation	❌ Stale accumulation	✅ Hot-swap zone
MCP Support	✅ Yes	✅ Yes	❌ No	✅ Yes (aggregator)
Token Budget	~200K	Unknown	200K (explicit)	Managed by Context Manager

Capability Ratings Comparison (1-5 scale)

Capability	Claude Code	Codex CLI	Squad Target	Notes
Can spawn persistent sub-agents	2	1 (none)	5	`resume` tested — DOES NOT WORK
Sub-agents can talk to each other	1	1	5	No change
Sub-agents spawn their own agents	1	1	5	No change
True parallelism	3 2	3	5	✅ VERIFIED: `run_in_background` + `AgentOutputTool` works
Detect external changes	2 (poll only)	3 (poll only)	5 (event-driven)	No change
Memory persists between sessions	1	2	5	`resume` tested — DOES NOT WORK
Context survives compaction intact	2	3	5	No change

Key Insight (CORRECTED 2025-12-08): Only TRUE PARALLELISM improved (2→3). Resume does not work. Average rating ~1.7. Squad value proposition strongly validated.

Context Window Analysis

Claude Code

Size: ~200K tokens
Stale Accumulation:
- 28+ <system-reminder> tags claiming "running" for dead processes
- File read cache (frozen content)
- Git status snapshot (from session start)
- Tool output accumulation
- Conversation history (no selective forgetting)
Token Waste: ~2000 tokens/response on stale system-reminders (95% noise)
Compaction: Summary-based, loses tool execution order, failure patterns, intermediate reasoning

Codex CLI

Size: Unknown (reports "93% context left" after introspection)
Stale Accumulation:
- Long instruction blocks (agents.md, README)
- Tool specs (large but static)
- Past tool outputs (bloat after acted on)
- Repo file excerpts (stale if changed)
Token Waste: Moderate (tool specs dominate)
Compaction: "Some trimming possible; large prompts may drop older content"

Squad Vision

Dynamic Zone: Hot-swap ~4K tokens every turn
Static Zone: System prompt, CLAUDE.md, ADRs (~15K)
Conversation Zone: Grows until checkpoint (~180K remaining)
No Stale Accumulation: Context Manager polls actual state

Tag/Marker Comparison

Claude Code Tags

Tag	Purpose	Accumulates?
`<system-reminder>`	Process status, alerts	YES (major problem)
`<env>`	Working dir, platform	No (static)
`<functions>`	Tool schemas	No (static)
`<function_calls>`	My tool invocations	YES
`<function_results>`	Tool outputs	YES (major)
`<examples>`	Few-shot examples	No (static)

Codex CLI Tags

Tag	Purpose	Accumulates?
`system/developer/user`	Instructions, environment	YES (until trimmed)
`<environment_context>`	cwd, sandbox mode	No (static)
`<INSTRUCTIONS>`	Repo protocols	No (static)
Tool definitions	JSON specs	No (static)
Channels	Response routing	No (static)

Key Difference

Claude Code: Explicit XML-style tags with clear structure
Codex CLI: Message-based (system/developer/user) with embedded context

Tool Inventory Comparison

Shared Capabilities

Capability	Claude Code Tool	Codex CLI Tool
File read	Read	shell_command (cat)
File write	Write, Edit	apply_patch
File search	Glob, Grep	shell_command (find, grep)
Shell execution	Bash	shell_command
Web fetch	WebFetch	shell_command (curl)
Browser automation	N/A	mcp__chrome-devtools__*
MCP tools	mcp__rube__*	mcp__rube__*
Supabase	mcp__supabase__*	mcp__supabase__*
Memory	mcp__supermemory__*	mcp__supermemory__*

Unique to Claude Code

Tool	Purpose
Task	Spawn sub-agents (Explore, Plan, software-engineer, etc.)
TodoWrite	Track task progress
AskUserQuestion	Multi-choice user queries
Skill	Execute slash commands
EnterPlanMode	Structured planning flow
NotebookEdit	Jupyter notebook editing
BashOutput/KillShell	Background process management

Unique to Codex CLI

Tool	Purpose
update_plan	Track plan steps
view_image	Attach local images
list_mcp_resources	MCP resource discovery
read_mcp_resource	MCP resource reading

Critical Difference: Subagent Spawning

Claude Code: Task tool can spawn 15+ specialized agents (Explore, Plan, software-engineer, qa-engineer, etc.)
Codex CLI: NO subagent spawning capability

Squad Adapter Requirements

Claude Code Adapter

interface ClaudeCodeAdapter {
  // Invocation
  cli: 'claude -p --output-format stream-json --mcp-config {config}'

  // Capabilities to leverage (VERIFIED 2025-12-08)
  subagents: true           // Task tool for delegation
  parallelism: 'background' // ✅ VERIFIED: run_in_background + AgentOutputTool works
  agentResume: false        // ❌ TESTED: resume parameter DOES NOT WORK

  // Limitations to work around
  staleness: 'high'         // Need Context Manager integration
  memory: 'none'            // ❌ TESTED: resume does not preserve transcript
  ephemeral: true           // ❌ TESTED: Subagents fully ephemeral, no persistence

  // Unique features
  todoTracking: true        // TodoWrite for progress visibility
  planMode: true            // EnterPlanMode for complex tasks
  backgroundShells: true    // ✅ VERIFIED: BashOutput/KillShell work
}

Codex CLI Adapter

interface CodexCliAdapter {
  // Invocation
  cli: 'codex exec --json'

  // Capabilities to leverage
  subagents: false  // No spawning, Squad must manage
  parallelism: 'limited'  // RUBE_MULTI_EXECUTE_TOOL

  // Limitations to work around
  staleness: 'moderate'  // Less noisy than Claude Code
  memory: 'minimal'  // MCP memory optional
  sandboxing: true  // Must respect approval_policy

  // Unique features
  browserAutomation: true  // chrome-devtools MCP
  patchEditing: true  // apply_patch for file changes
}

What Squad Provides (Value Add)

For both harnesses, Squad fills these gaps:

Gap	Current State	Squad Solution
Persistent Agents	Ephemeral or none	Persistent sessions with memory
True Parallelism	Sequential or fake	Async channels with DAG execution
Inter-Agent Communication	Not supported	Lateral channels
Context Freshness	Stale accumulation	Hot-swap dynamic zone (ADR-023)
Memory Across Sessions	Context reset	Checkpoint system (ADR-017)
External Change Detection	Poll only	Parallel Monitor integration
Recursive Spawning	Not supported	Engineers spawn their own Scouts

Integration Priority

Based on analysis:

Claude Code (HIGH) - Has subagent spawning (can delegate), needs Squad for persistence + context management
Codex CLI (MEDIUM) - No subagent spawning (Squad must manage all delegation), good MCP support
Factory CLI (PENDING) - Awaiting introspection report
Gemini CLI (PENDING) - Awaiting introspection report
Jules CLI (PENDING) - Awaiting introspection report
Cursor/Windsurf (PENDING) - GUI-based, different integration pattern

Appendix: Raw Capability Scores

Claude Code Self-Assessment (VERIFIED 2025-12-08)

Can spawn persistent sub-agents: 2 (resume TESTED - DOES NOT WORK)
Sub-agents can talk to each other: 1 (no lateral channels)
Sub-agents spawn their own agents: 1 (only Manager spawns)
True parallelism: 3 (run_in_background + AgentOutputTool) ✅ VERIFIED
Detect external changes: 2 (must poll, no events)
Memory persists between sessions: 1 (resume TESTED - DOES NOT WORK)
Context survives compaction: 2 (summary only)

Average: 1.7 (only parallelism improved)

Codex CLI Self-Assessment

Can spawn persistent sub-agents: 1 (no sub-agent tools)
Sub-agents can talk to each other: 1 (not supported)
Sub-agents spawn their own agents: 1 (not supported)
True parallelism: 3 (limited via multi-execute)
Detect external changes: 3 (must poll via shell)
Memory persists between sessions: 2 (MCP optional)
Context survives compaction: 3 (some trimming)

Next Steps:

Collect Factory CLI introspection
Collect remaining harness introspections (Gemini, Jules, Cursor, Windsurf)
Build adapter interfaces per harness
Implement ADR-009 (Harness-Agnostic Adapters)

Document Author: Claude (Manager role, Claude Code instance) Related: ADR-009, harness-introspection-prompt.md

Deep Dive: Agent Resume Capability

Added 2025-12-08 after user requested deeper investigation UPDATED 2025-12-08: Live testing shows resume DOES NOT WORK as documented

What `resume` Claims To Do

From the Task tool definition:

resume: string
"Optional agent ID to resume from. If provided, the agent will
continue from the previous execution transcript."

Live Test Results (2025-12-08)

Test procedure:

Spawned background agent f0501caf → returned "BACKGROUND_TEST_SUCCESS"
Attempted to resume with new agent using resume: "f0501caf"
Asked new agent: "What was your previous response?"

Result:

"NO_PREVIOUS_CONTEXT - This is the start of our conversation -
I have no record of previous messages or responses from earlier in this session."

Conclusion: Resume Works for TRANSCRIPT, Not CONTEXT

Documented Behavior	Actual Behavior
"Continue from previous execution transcript"	⚠️ MISLEADING — continues LOGGING to same file
Implies agent has memory	Agent has NO access to previous responses

What resume actually does (verified via transcript inspection):

agent-f0501caf.jsonl contains:
  Line 1: "BACKGROUND_TEST_SUCCESS..." (original run)
  Line 2: "NO_PREVIOUS_CONTEXT..."     (resumed run, SAME FILE)

✅ Both runs append to the SAME transcript file
✅ Both runs use the SAME agentId
❌ The resumed agent does NOT see its previous output in context

This is AUDIT TRAIL persistence, not MEMORY persistence.

The transcript is for the parent session's reference, not the subagent's context window.

Rating Justification (CORRECTED)

Rating: 2 — Subagents ARE ephemeral, resume does not provide persistence
Previous upgrade to 3 was based on documentation, not testing
Live test disproves the documented behavior

Subagents remain ephemeral:

No persistent memory
No transcript continuation (tested and failed)
No lateral communication
No recursive spawning

Changelog

Date	Auditor	Changes	Evidence
2025-12-08	Claude (Opus 4.5)	DISCOVERY: `resume` = transcript persistence, NOT context persistence	Transcript inspection: both runs in same file, but agent has no memory
2025-12-08	Claude (Opus 4.5)	Subagent Persistence: Confirmed ❌ Ephemeral	Live test disproved resume functionality
2025-12-08	Claude (Opus 4.5)	Memory: Confirmed rating 1 (no persistence)	Live test disproved resume functionality
2025-12-08	Claude (Opus 4.5)	True Parallelism: ✅ VERIFIED rating 3	Live test: `run_in_background` + `AgentOutputTool` works
2025-12-08	Claude (Opus 4.5)	Background Shells: ✅ VERIFIED	Live test: `BashOutput` retrieved shell output
2025-12-08	Claude (Opus 4.5)	Added Self-Update Protocol section	User instruction
2025-12-08	Claude (Opus 4.5)	Added Deep Dive: Agent Resume section	User requested deeper investigation
2025-12-08	Claude (Opus 4.5)	~~Updated ratings based on docs~~ (REVERTED)	Lesson: Test before trusting documentation
2025-12-06	Claude (Opus 4.5)	Initial document creation	Harness introspection exercise

danialhasan/matrix.md

Select an option

No results found

Select an option

No results found

Harness Comparison Matrix

Self-Update Protocol

When to Update This Document

How to Update

Evidence Requirements

Executive Summary

Capability Ratings Comparison (1-5 scale)

Context Window Analysis

Claude Code

Codex CLI

Squad Vision

Tag/Marker Comparison

Claude Code Tags

Codex CLI Tags

Key Difference

Tool Inventory Comparison

Shared Capabilities

Unique to Claude Code

Unique to Codex CLI

Critical Difference: Subagent Spawning

Squad Adapter Requirements

Claude Code Adapter

Codex CLI Adapter

What Squad Provides (Value Add)

Integration Priority

Appendix: Raw Capability Scores

Claude Code Self-Assessment (VERIFIED 2025-12-08)

Codex CLI Self-Assessment

Deep Dive: Agent Resume Capability

What `resume` Claims To Do

Live Test Results (2025-12-08)

Conclusion: Resume Works for TRANSCRIPT, Not CONTEXT

Rating Justification (CORRECTED)

Changelog

danialhasan/matrix.md

Harness Comparison Matrix

Self-Update Protocol

When to Update This Document

How to Update

Evidence Requirements

Executive Summary

Capability Ratings Comparison (1-5 scale)

Context Window Analysis

Claude Code

Codex CLI

Squad Vision

Tag/Marker Comparison

Claude Code Tags

Codex CLI Tags

Key Difference

Tool Inventory Comparison

Shared Capabilities

Unique to Claude Code

Unique to Codex CLI

Critical Difference: Subagent Spawning

Squad Adapter Requirements

Claude Code Adapter

Codex CLI Adapter

What Squad Provides (Value Add)

Integration Priority

Appendix: Raw Capability Scores

Claude Code Self-Assessment (VERIFIED 2025-12-08)

Codex CLI Self-Assessment

Deep Dive: Agent Resume Capability

What resume Claims To Do

Live Test Results (2025-12-08)

Conclusion: Resume Works for TRANSCRIPT, Not CONTEXT

Rating Justification (CORRECTED)

Changelog

What `resume` Claims To Do