Skip to content

Instantly share code, notes, and snippets.

@sumitngupta
Created March 28, 2026 22:28
Show Gist options
  • Select an option

  • Save sumitngupta/d42de8f63c462fe766592ea5c47e02c9 to your computer and use it in GitHub Desktop.

Select an option

Save sumitngupta/d42de8f63c462fe766592ea5c47e02c9 to your computer and use it in GitHub Desktop.

Claude Code Session Manager — Architecture Design

Clean-sheet redesign of Claude Code session management for the Baymax agent platform. Replaces the custom tmux bridge and trigger daemon with the Claude Agent SDK + Bus integration.

Problem with Current Architecture

Lab → Bus (skill) → Trigger Daemon (port 3099) → claude -p subprocess
                                                    ↓ events
                                                  Bus (passive listener) → Lab callback

Three services in the critical path. Trigger daemon holds canonical state in memory (lost on restart). Bus is passive/reactive. Lab polls for status. The trigger daemon is a 600-line mjs file doing subprocess management, HTTP API, SSE streaming, worktree management, and session persistence — all responsibilities that should be separated.

Key Insight: Agent SDK, Not Remote Control

"Remote Control" (claude remote-control) is a UI feature for controlling local CLI from claude.ai/mobile — not a programmatic API. The actual programmatic interface is the Claude Agent SDK (@anthropic-ai/claude-agent-sdk):

import { query } from "@anthropic-ai/claude-agent-sdk";

for await (const message of query({
  prompt: "Find and fix the bug in auth.py",
  options: {
    allowedTools: ["Read", "Edit", "Bash"],
    maxBudgetUsd: 5.00,
    maxTurns: 50,
    hooks: { Stop: (data) => { /* ... */ } },
  }
})) {
  // Streaming messages: init (session_id), tool_use, result
}

The SDK handles session lifecycle, tool permissions, budget limits, and structured output natively — eliminating the need for subprocess management, terminal scraping, and custom watchdogs.

New Architecture

Embed the session manager in the Bus. The Bus already has persistent session registry (SQLite), event log with replay, WebSocket pub/sub, and durable workflows (OpenWorkflow).

Any caller (Lab, Hank, API)
    → Bus API: POST /sessions/spawn
        → Bus Session Manager (Agent SDK query())
            → streams events to Bus WebSocket
            → persists state to Bus SQLite
            → publishes session.* events
        → caller subscribes to events via WebSocket

Core Components

1. Session Manager (Bus service)

import { query } from "@anthropic-ai/claude-agent-sdk";

class SessionManager {
  // Spawn: creates session record, starts query(), streams results
  async spawn(config: SessionConfig): Promise<Session>

  // Send: resumes an idle session with a follow-up
  async send(sessionId: string, message: string): Promise<void>

  // Kill: aborts the running query, marks session failed
  async kill(sessionId: string): Promise<void>

  // List: reads from SQLite (survives restarts)
  async list(filter?: SessionFilter): Promise<Session[]>
}

2. Session Config (what callers provide)

interface SessionConfig {
  prompt: string;
  workDir: string;

  // Identity
  name?: string;           // human-readable session name
  owner?: string;          // agent or user who spawned it

  // Linkage
  runIdentifier?: string;  // Lab run (RUN-*) for completion callback
  parentSessionId?: string; // for session chains

  // Limits
  maxBudgetUsd?: number;   // default: 5.00
  maxTurns?: number;       // default: 50
  idleTimeoutMs?: number;  // default: 120_000

  // Agent SDK options
  allowedTools?: string[];
  permissionMode?: string;
  claudeMdContent?: string; // injected as project context

  // Callback routing
  notify?: {
    channel: "telegram" | "slack" | "bus";
    target: string;        // chat ID, channel ID, or bus topic
  };
}

3. Session Lifecycle Events (all on Bus WebSocket)

Event When
session.created Config validated, queued
session.spawned Agent SDK query() started
session.output Streaming chunks (opt-in subscription)
session.tool_use Tool call started (name, args)
session.tool_done Tool call completed (result summary)
session.idle No activity for idleTimeoutMs
session.resumed Follow-up message sent
session.completed query() finished, result captured
session.failed Error or budget exceeded
session.killed Manually terminated

4. Persistence (Bus SQLite)

Extend the existing Bus sessions table:

ALTER TABLE sessions ADD COLUMN agent_session_id TEXT;  -- Agent SDK session ID for resume
ALTER TABLE sessions ADD COLUMN config JSON;            -- full SessionConfig
ALTER TABLE sessions ADD COLUMN result TEXT;             -- final output
ALTER TABLE sessions ADD COLUMN cost_usd REAL;          -- actual cost
ALTER TABLE sessions ADD COLUMN turns INTEGER;          -- actual turns used
ALTER TABLE sessions ADD COLUMN error TEXT;             -- error message if failed

5. Concurrency & Queueing

const LIMITS = {
  maxConcurrent: 4,        // total active sessions
  maxPerOwner: 2,          // per agent/user
  queueTimeout: 300_000,   // 5 min queue wait before failing
};

// When at capacity: queue with position, publish session.queued event
// When slot opens: dequeue next, publish session.spawned

6. Completion Hooks (replaces session-lifecycle handler)

manager.onComplete((session) => {
  // If linked to a Lab run, complete the stage
  if (session.config.runIdentifier) {
    await labApi.completeStage(session.config.runIdentifier, {
      summary: session.result,
      cost: session.costUsd,
    });
  }

  // Notify via configured channel
  if (session.config.notify) {
    await bus.publish(`notification.${session.config.notify.channel}`, {
      target: session.config.notify.target,
      message: `Session ${session.name} completed`,
    });
  }
});

What This Eliminates

Current New
Trigger daemon (task-trigger.mjs, port 3099) Gone — absorbed into Bus
tmux / claude -p subprocess management Agent SDK query() handles it
sessions.json file persistence Bus SQLite (already exists)
Lab sessions API (proxy to trigger) Direct Bus API
session-lifecycle.ts (passive handler) SessionManager completion hooks
implement-run skill (spawn wrapper) SessionManager.spawn() directly
Worktree management in trigger daemon Agent SDK spawn: "worktree" option or pre-spawn hook

What Callers Look Like

Hank (builder agent) spawning a session:

const session = await bus.post("/sessions/spawn", {
  prompt: "Implement the auth refactor per the plan...",
  workDir: "/home/sumit/projects/archie-core",
  runIdentifier: "RUN-5",
  owner: "hank",
  maxBudgetUsd: 3.00,
  notify: { channel: "telegram", target: "sumitngupta" },
});

Lab pipeline:

await bus.post("/sessions/spawn", {
  prompt: buildImplementPrompt(run, artifacts),
  workDir: resolveWorkDir(run),
  runIdentifier: run.identifier,
  owner: "lab",
});

Monitoring (any WebSocket subscriber):

bus.subscribe("session.*", (event) => {
  // Real-time session lifecycle — Mission Control, Lab UI, CLI all get same stream
});

Migration Path

  1. Add Agent SDK to Busnpm install @anthropic-ai/claude-agent-sdk
  2. Build SessionManager class in Bus with spawn/send/kill/list
  3. Add Bus HTTP endpointsPOST /sessions/spawn, GET /sessions, etc.
  4. Wire Hank's tools to call Bus instead of trigger daemon
  5. Wire Lab implement handler to call Bus SessionManager
  6. Move completion hooks from session-lifecycle.ts into SessionManager
  7. Deprecate trigger daemon — stop the systemd service
  8. Clean up — remove archie-core claude-code tools that proxy to trigger daemon

The Big Win

One service owns session lifecycle end-to-end (Bus). State survives restarts (SQLite). Any service can subscribe to events (WebSocket). The Agent SDK handles the actual Claude interaction properly instead of subprocess management.


Baymax Agent Platform — 2026-03-28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment