Skip to content

Instantly share code, notes, and snippets.

@danialhasan
Created December 9, 2025 21:30
Show Gist options
  • Select an option

  • Save danialhasan/e346c533b421a330a5da033d4b9d1b14 to your computer and use it in GitHub Desktop.

Select an option

Save danialhasan/e346c533b421a330a5da033d4b9d1b14 to your computer and use it in GitHub Desktop.

Harness Comparison Matrix

Date: 2025-12-06 Last Audit: 2025-12-08 Purpose: Compare AI coding agent harnesses for Squad integration requirements Status: 2 of N harnesses analyzed (Claude Code, Codex CLI)


Self-Update Protocol

INSTRUCTION TO CLAUDE: When you verify your own capabilities or discover changes to your harness, you MUST update this document. This is a living comparison matrix, not a static snapshot.

When to Update This Document

  1. After any capability audit — If you verify claims in claude-code-vs-squad-harness-user-stories.md
  2. When you discover new tools — Check if they affect any rating in this matrix
  3. When you hit a limitation — Document it as evidence
  4. When you successfully use a capability — Update if rating was too low

How to Update

  1. Change the Last Audit date in the header
  2. Update the relevant rating in the comparison tables
  3. Add an entry to the Changelog at the bottom
  4. Cite evidence — tool name, parameter, or test result

Evidence Requirements

Each rating change must include:

  • Tool/Parameter: What capability enables this?
  • Test Result: Did you verify it works?
  • Rating Change: Old → New with justification

Executive Summary

Dimension Claude Code Codex CLI Factory Droid Squad Vision
Model Claude Opus 4.5 GPT-5-based Claude Sonnet 4.5 Any (harness-agnostic)
Subagent Spawning ✅ Yes (Task tool) ❌ No ❌ No ✅ Yes (persistent)
Subagent Persistence ❌ Ephemeral (resume FAILS) N/A N/A ✅ Persistent
True Parallelism ✅ Background (verified) ⚠️ Limited (multi-execute) ⚠️ Limited (parallel tool calls) ✅ True async
Memory Across Sessions ❌ No (resume FAILS) ❌ No (MCP optional) ❌ No ✅ Checkpoints
Context Refresh ❌ Stale accumulation ❌ Stale accumulation ❌ Stale accumulation ✅ Hot-swap zone
MCP Support ✅ Yes ✅ Yes ❌ No ✅ Yes (aggregator)
Token Budget ~200K Unknown 200K (explicit) Managed by Context Manager

Capability Ratings Comparison (1-5 scale)

Capability Claude Code Codex CLI Squad Target Notes
Can spawn persistent sub-agents 2 1 (none) 5 resume tested — DOES NOT WORK
Sub-agents can talk to each other 1 1 5 No change
Sub-agents spawn their own agents 1 1 5 No change
True parallelism 3 2 3 5 ✅ VERIFIED: run_in_background + AgentOutputTool works
Detect external changes 2 (poll only) 3 (poll only) 5 (event-driven) No change
Memory persists between sessions 1 2 5 resume tested — DOES NOT WORK
Context survives compaction intact 2 3 5 No change

Key Insight (CORRECTED 2025-12-08): Only TRUE PARALLELISM improved (2→3). Resume does not work. Average rating ~1.7. Squad value proposition strongly validated.


Context Window Analysis

Claude Code

  • Size: ~200K tokens
  • Stale Accumulation:
    • 28+ <system-reminder> tags claiming "running" for dead processes
    • File read cache (frozen content)
    • Git status snapshot (from session start)
    • Tool output accumulation
    • Conversation history (no selective forgetting)
  • Token Waste: ~2000 tokens/response on stale system-reminders (95% noise)
  • Compaction: Summary-based, loses tool execution order, failure patterns, intermediate reasoning

Codex CLI

  • Size: Unknown (reports "93% context left" after introspection)
  • Stale Accumulation:
    • Long instruction blocks (agents.md, README)
    • Tool specs (large but static)
    • Past tool outputs (bloat after acted on)
    • Repo file excerpts (stale if changed)
  • Token Waste: Moderate (tool specs dominate)
  • Compaction: "Some trimming possible; large prompts may drop older content"

Squad Vision

  • Dynamic Zone: Hot-swap ~4K tokens every turn
  • Static Zone: System prompt, CLAUDE.md, ADRs (~15K)
  • Conversation Zone: Grows until checkpoint (~180K remaining)
  • No Stale Accumulation: Context Manager polls actual state

Tag/Marker Comparison

Claude Code Tags

Tag Purpose Accumulates?
<system-reminder> Process status, alerts YES (major problem)
<env> Working dir, platform No (static)
<functions> Tool schemas No (static)
<function_calls> My tool invocations YES
<function_results> Tool outputs YES (major)
<examples> Few-shot examples No (static)

Codex CLI Tags

Tag Purpose Accumulates?
system/developer/user Instructions, environment YES (until trimmed)
<environment_context> cwd, sandbox mode No (static)
<INSTRUCTIONS> Repo protocols No (static)
Tool definitions JSON specs No (static)
Channels Response routing No (static)

Key Difference

  • Claude Code: Explicit XML-style tags with clear structure
  • Codex CLI: Message-based (system/developer/user) with embedded context

Tool Inventory Comparison

Shared Capabilities

Capability Claude Code Tool Codex CLI Tool
File read Read shell_command (cat)
File write Write, Edit apply_patch
File search Glob, Grep shell_command (find, grep)
Shell execution Bash shell_command
Web fetch WebFetch shell_command (curl)
Browser automation N/A mcp__chrome-devtools__*
MCP tools mcp__rube__* mcp__rube__*
Supabase mcp__supabase__* mcp__supabase__*
Memory mcp__supermemory__* mcp__supermemory__*

Unique to Claude Code

Tool Purpose
Task Spawn sub-agents (Explore, Plan, software-engineer, etc.)
TodoWrite Track task progress
AskUserQuestion Multi-choice user queries
Skill Execute slash commands
EnterPlanMode Structured planning flow
NotebookEdit Jupyter notebook editing
BashOutput/KillShell Background process management

Unique to Codex CLI

Tool Purpose
update_plan Track plan steps
view_image Attach local images
list_mcp_resources MCP resource discovery
read_mcp_resource MCP resource reading

Critical Difference: Subagent Spawning

  • Claude Code: Task tool can spawn 15+ specialized agents (Explore, Plan, software-engineer, qa-engineer, etc.)
  • Codex CLI: NO subagent spawning capability

Squad Adapter Requirements

Claude Code Adapter

interface ClaudeCodeAdapter {
  // Invocation
  cli: 'claude -p --output-format stream-json --mcp-config {config}'

  // Capabilities to leverage (VERIFIED 2025-12-08)
  subagents: true           // Task tool for delegation
  parallelism: 'background' // ✅ VERIFIED: run_in_background + AgentOutputTool works
  agentResume: false        // ❌ TESTED: resume parameter DOES NOT WORK

  // Limitations to work around
  staleness: 'high'         // Need Context Manager integration
  memory: 'none'            // ❌ TESTED: resume does not preserve transcript
  ephemeral: true           // ❌ TESTED: Subagents fully ephemeral, no persistence

  // Unique features
  todoTracking: true        // TodoWrite for progress visibility
  planMode: true            // EnterPlanMode for complex tasks
  backgroundShells: true    // ✅ VERIFIED: BashOutput/KillShell work
}

Codex CLI Adapter

interface CodexCliAdapter {
  // Invocation
  cli: 'codex exec --json'

  // Capabilities to leverage
  subagents: false  // No spawning, Squad must manage
  parallelism: 'limited'  // RUBE_MULTI_EXECUTE_TOOL

  // Limitations to work around
  staleness: 'moderate'  // Less noisy than Claude Code
  memory: 'minimal'  // MCP memory optional
  sandboxing: true  // Must respect approval_policy

  // Unique features
  browserAutomation: true  // chrome-devtools MCP
  patchEditing: true  // apply_patch for file changes
}

What Squad Provides (Value Add)

For both harnesses, Squad fills these gaps:

Gap Current State Squad Solution
Persistent Agents Ephemeral or none Persistent sessions with memory
True Parallelism Sequential or fake Async channels with DAG execution
Inter-Agent Communication Not supported Lateral channels
Context Freshness Stale accumulation Hot-swap dynamic zone (ADR-023)
Memory Across Sessions Context reset Checkpoint system (ADR-017)
External Change Detection Poll only Parallel Monitor integration
Recursive Spawning Not supported Engineers spawn their own Scouts

Integration Priority

Based on analysis:

  1. Claude Code (HIGH) - Has subagent spawning (can delegate), needs Squad for persistence + context management
  2. Codex CLI (MEDIUM) - No subagent spawning (Squad must manage all delegation), good MCP support
  3. Factory CLI (PENDING) - Awaiting introspection report
  4. Gemini CLI (PENDING) - Awaiting introspection report
  5. Jules CLI (PENDING) - Awaiting introspection report
  6. Cursor/Windsurf (PENDING) - GUI-based, different integration pattern

Appendix: Raw Capability Scores

Claude Code Self-Assessment (VERIFIED 2025-12-08)

Can spawn persistent sub-agents: 2 (resume TESTED - DOES NOT WORK)
Sub-agents can talk to each other: 1 (no lateral channels)
Sub-agents spawn their own agents: 1 (only Manager spawns)
True parallelism: 3 (run_in_background + AgentOutputTool) ✅ VERIFIED
Detect external changes: 2 (must poll, no events)
Memory persists between sessions: 1 (resume TESTED - DOES NOT WORK)
Context survives compaction: 2 (summary only)

Average: 1.7 (only parallelism improved)

Codex CLI Self-Assessment

Can spawn persistent sub-agents: 1 (no sub-agent tools)
Sub-agents can talk to each other: 1 (not supported)
Sub-agents spawn their own agents: 1 (not supported)
True parallelism: 3 (limited via multi-execute)
Detect external changes: 3 (must poll via shell)
Memory persists between sessions: 2 (MCP optional)
Context survives compaction: 3 (some trimming)

Next Steps:

  1. Collect Factory CLI introspection
  2. Collect remaining harness introspections (Gemini, Jules, Cursor, Windsurf)
  3. Build adapter interfaces per harness
  4. Implement ADR-009 (Harness-Agnostic Adapters)

Document Author: Claude (Manager role, Claude Code instance) Related: ADR-009, harness-introspection-prompt.md


Deep Dive: Agent Resume Capability

Added 2025-12-08 after user requested deeper investigation UPDATED 2025-12-08: Live testing shows resume DOES NOT WORK as documented

What resume Claims To Do

From the Task tool definition:

resume: string
"Optional agent ID to resume from. If provided, the agent will
continue from the previous execution transcript."

Live Test Results (2025-12-08)

Test procedure:

  1. Spawned background agent f0501caf → returned "BACKGROUND_TEST_SUCCESS"
  2. Attempted to resume with new agent using resume: "f0501caf"
  3. Asked new agent: "What was your previous response?"

Result:

"NO_PREVIOUS_CONTEXT - This is the start of our conversation -
I have no record of previous messages or responses from earlier in this session."

Conclusion: Resume Works for TRANSCRIPT, Not CONTEXT

Documented Behavior Actual Behavior
"Continue from previous execution transcript" ⚠️ MISLEADING — continues LOGGING to same file
Implies agent has memory Agent has NO access to previous responses

What resume actually does (verified via transcript inspection):

agent-f0501caf.jsonl contains:
  Line 1: "BACKGROUND_TEST_SUCCESS..." (original run)
  Line 2: "NO_PREVIOUS_CONTEXT..."     (resumed run, SAME FILE)
  • ✅ Both runs append to the SAME transcript file
  • ✅ Both runs use the SAME agentId
  • ❌ The resumed agent does NOT see its previous output in context

This is AUDIT TRAIL persistence, not MEMORY persistence.

The transcript is for the parent session's reference, not the subagent's context window.

Rating Justification (CORRECTED)

  • Rating: 2 — Subagents ARE ephemeral, resume does not provide persistence
  • Previous upgrade to 3 was based on documentation, not testing
  • Live test disproves the documented behavior

Subagents remain ephemeral:

  • No persistent memory
  • No transcript continuation (tested and failed)
  • No lateral communication
  • No recursive spawning

Changelog

Date Auditor Changes Evidence
2025-12-08 Claude (Opus 4.5) DISCOVERY: resume = transcript persistence, NOT context persistence Transcript inspection: both runs in same file, but agent has no memory
2025-12-08 Claude (Opus 4.5) Subagent Persistence: Confirmed ❌ Ephemeral Live test disproved resume functionality
2025-12-08 Claude (Opus 4.5) Memory: Confirmed rating 1 (no persistence) Live test disproved resume functionality
2025-12-08 Claude (Opus 4.5) True Parallelism: ✅ VERIFIED rating 3 Live test: run_in_background + AgentOutputTool works
2025-12-08 Claude (Opus 4.5) Background Shells: ✅ VERIFIED Live test: BashOutput retrieved shell output
2025-12-08 Claude (Opus 4.5) Added Self-Update Protocol section User instruction
2025-12-08 Claude (Opus 4.5) Added Deep Dive: Agent Resume section User requested deeper investigation
2025-12-08 Claude (Opus 4.5) Updated ratings based on docs (REVERTED) Lesson: Test before trusting documentation
2025-12-06 Claude (Opus 4.5) Initial document creation Harness introspection exercise
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment