Claude Code Session Manager — Architecture Design

Clean-sheet redesign of Claude Code session management for the Baymax agent platform. Replaces the custom tmux bridge and trigger daemon with the Claude Agent SDK + Bus integration.

Problem with Current Architecture

Lab → Bus (skill) → Trigger Daemon (port 3099) → claude -p subprocess
                                                    ↓ events
                                                  Bus (passive listener) → Lab callback

Three services in the critical path. Trigger daemon holds canonical state in memory (lost on restart). Bus is passive/reactive. Lab polls for status. The trigger daemon is a 600-line mjs file doing subprocess management, HTTP API, SSE streaming, worktree management, and session persistence — all responsibilities that should be separated.

Key Insight: Agent SDK, Not Remote Control

"Remote Control" (claude remote-control) is a UI feature for controlling local CLI from claude.ai/mobile — not a programmatic API. The actual programmatic interface is the Claude Agent SDK (@anthropic-ai/claude-agent-sdk):

import { query } from "@anthropic-ai/claude-agent-sdk";

for await (const message of query({
  prompt: "Find and fix the bug in auth.py",
  options: {
    allowedTools: ["Read", "Edit", "Bash"],
    maxBudgetUsd: 5.00,
    maxTurns: 50,
    hooks: { Stop: (data) => { /* ... */ } },
  }
})) {
  // Streaming messages: init (session_id), tool_use, result
}

The SDK handles session lifecycle, tool permissions, budget limits, and structured output natively — eliminating the need for subprocess management, terminal scraping, and custom watchdogs.

New Architecture

Embed the session manager in the Bus. The Bus already has persistent session registry (SQLite), event log with replay, WebSocket pub/sub, and durable workflows (OpenWorkflow).

Any caller (Lab, Hank, API)
    → Bus API: POST /sessions/spawn
        → Bus Session Manager (Agent SDK query())
            → streams events to Bus WebSocket
            → persists state to Bus SQLite
            → publishes session.* events
        → caller subscribes to events via WebSocket

Core Components

1. Session Manager (Bus service)

import { query } from "@anthropic-ai/claude-agent-sdk";

class SessionManager {
  // Spawn: creates session record, starts query(), streams results
  async spawn(config: SessionConfig): Promise<Session>

  // Send: resumes an idle session with a follow-up
  async send(sessionId: string, message: string): Promise<void>

  // Kill: aborts the running query, marks session failed
  async kill(sessionId: string): Promise<void>

  // List: reads from SQLite (survives restarts)
  async list(filter?: SessionFilter): Promise<Session[]>
}

2. Session Config (what callers provide)

interface SessionConfig {
  prompt: string;
  workDir: string;

  // Identity
  name?: string;           // human-readable session name
  owner?: string;          // agent or user who spawned it

  // Linkage
  runIdentifier?: string;  // Lab run (RUN-*) for completion callback
  parentSessionId?: string; // for session chains

  // Limits
  maxBudgetUsd?: number;   // default: 5.00
  maxTurns?: number;       // default: 50
  idleTimeoutMs?: number;  // default: 120_000

  // Agent SDK options
  allowedTools?: string[];
  permissionMode?: string;
  claudeMdContent?: string; // injected as project context

  // Callback routing
  notify?: {
    channel: "telegram" | "slack" | "bus";
    target: string;        // chat ID, channel ID, or bus topic
  };
}

3. Session Lifecycle Events (all on Bus WebSocket)

Event	When
`session.created`	Config validated, queued
`session.spawned`	Agent SDK `query()` started
`session.output`	Streaming chunks (opt-in subscription)
`session.tool_use`	Tool call started (name, args)
`session.tool_done`	Tool call completed (result summary)
`session.idle`	No activity for `idleTimeoutMs`
`session.resumed`	Follow-up message sent
`session.completed`	`query()` finished, result captured
`session.failed`	Error or budget exceeded
`session.killed`	Manually terminated

4. Persistence (Bus SQLite)

Extend the existing Bus sessions table:

ALTER TABLE sessions ADD COLUMN agent_session_id TEXT;  -- Agent SDK session ID for resume
ALTER TABLE sessions ADD COLUMN config JSON;            -- full SessionConfig
ALTER TABLE sessions ADD COLUMN result TEXT;             -- final output
ALTER TABLE sessions ADD COLUMN cost_usd REAL;          -- actual cost
ALTER TABLE sessions ADD COLUMN turns INTEGER;          -- actual turns used
ALTER TABLE sessions ADD COLUMN error TEXT;             -- error message if failed

5. Concurrency & Queueing

const LIMITS = {
  maxConcurrent: 4,        // total active sessions
  maxPerOwner: 2,          // per agent/user
  queueTimeout: 300_000,   // 5 min queue wait before failing
};

// When at capacity: queue with position, publish session.queued event
// When slot opens: dequeue next, publish session.spawned

6. Completion Hooks (replaces session-lifecycle handler)

manager.onComplete((session) => {
  // If linked to a Lab run, complete the stage
  if (session.config.runIdentifier) {
    await labApi.completeStage(session.config.runIdentifier, {
      summary: session.result,
      cost: session.costUsd,
    });
  }

  // Notify via configured channel
  if (session.config.notify) {
    await bus.publish(`notification.${session.config.notify.channel}`, {
      target: session.config.notify.target,
      message: `Session ${session.name} completed`,
    });
  }
});

What This Eliminates

Current	New
Trigger daemon (task-trigger.mjs, port 3099)	Gone — absorbed into Bus
tmux / claude -p subprocess management	Agent SDK `query()` handles it
sessions.json file persistence	Bus SQLite (already exists)
Lab sessions API (proxy to trigger)	Direct Bus API
session-lifecycle.ts (passive handler)	SessionManager completion hooks
implement-run skill (spawn wrapper)	SessionManager.spawn() directly
Worktree management in trigger daemon	Agent SDK `spawn: "worktree"` option or pre-spawn hook

What Callers Look Like

Hank (builder agent) spawning a session:

const session = await bus.post("/sessions/spawn", {
  prompt: "Implement the auth refactor per the plan...",
  workDir: "/home/sumit/projects/archie-core",
  runIdentifier: "RUN-5",
  owner: "hank",
  maxBudgetUsd: 3.00,
  notify: { channel: "telegram", target: "sumitngupta" },
});

Lab pipeline:

await bus.post("/sessions/spawn", {
  prompt: buildImplementPrompt(run, artifacts),
  workDir: resolveWorkDir(run),
  runIdentifier: run.identifier,
  owner: "lab",
});

Monitoring (any WebSocket subscriber):

bus.subscribe("session.*", (event) => {
  // Real-time session lifecycle — Mission Control, Lab UI, CLI all get same stream
});

Migration Path

Add Agent SDK to Bus — npm install @anthropic-ai/claude-agent-sdk
Build SessionManager class in Bus with spawn/send/kill/list
Add Bus HTTP endpoints — POST /sessions/spawn, GET /sessions, etc.
Wire Hank's tools to call Bus instead of trigger daemon
Wire Lab implement handler to call Bus SessionManager
Move completion hooks from session-lifecycle.ts into SessionManager
Deprecate trigger daemon — stop the systemd service
Clean up — remove archie-core claude-code tools that proxy to trigger daemon

The Big Win

One service owns session lifecycle end-to-end (Bus). State survives restarts (SQLite). Any service can subscribe to events (WebSocket). The Agent SDK handles the actual Claude interaction properly instead of subprocess management.

Baymax Agent Platform — 2026-03-28

sumitngupta/claude-code-session-manager-design.md

Select an option

No results found