Skip to content

Instantly share code, notes, and snippets.

@sjenning
Created June 4, 2026 21:13
Show Gist options
  • Select an option

  • Save sjenning/f03bfcf7ae5e24fa64e1b5c5356190f3 to your computer and use it in GitHub Desktop.

Select an option

Save sjenning/f03bfcf7ae5e24fa64e1b5c5356190f3 to your computer and use it in GitHub Desktop.

Agent Substrate: How It Works in a Kubernetes Cluster

Agent Substrate implements a virtual actor model on Kubernetes. The core idea: decouple long-lived stateful actors from physical pods so that thousands of suspended actors can share a small pool of warm worker pods. An actor's full memory and filesystem state is checkpointed to object storage (GCS/S3) when idle, and restored onto any available worker pod in milliseconds when traffic arrives.

Here's how each component contributes.


atecontroller — The Declarative Foundation

The controller is the entry point for platform operators. It reconciles two CRDs:

WorkerPool — defines a pool of warm compute capacity. The controller creates a Kubernetes Deployment (via server-side apply) with replicas pods, each running the ateom-gvisor container image. It owns the Deployment via owner references so deleting the WorkerPool cascades. It syncs the Deployment's actual replica count back into WorkerPool status.

ActorTemplate — defines a workload specification (container images, environment, snapshot storage location, gVisor binary config). The controller runs a multi-phase initialization state machine:

  1. Creates a temporary "golden actor" via ateapi's CreateActor RPC
  2. Resumes it via ResumeActor to boot the workload fresh
  3. Waits 20 seconds for initialization, then suspends it via SuspendActor
  4. Stores the resulting snapshot as the golden snapshot — a pre-warmed checkpoint that all future actors of this template clone from, avoiding cold-start costs

The controller is the only component that talks to both the Kubernetes API (for Deployments) and ateapi (for golden actor lifecycle).


ateapi — The Stateful Control Plane

ateapi is a stateless gRPC server that manages all actor and worker state in Redis/Valkey (not etcd). This is a deliberate design choice: actors are high-churn, high-volume objects (target: 1 billion per cluster) that would overwhelm the Kubernetes API server.

What it stores in Redis

  • Actor records: ID, status (SUSPENDED/RESUMING/RUNNING/SUSPENDING), version, template reference, assigned worker, snapshot paths
  • Worker records: namespace, pool, pod name, IP, assigned actor
  • Distributed locks for multi-step workflows (actor resume/suspend)

Key RPCs

  • CreateActor — writes a new actor record in SUSPENDED state
  • ResumeActor — orchestrates a multi-step workflow: find a free worker in the right pool → mark it assigned → tell atelet to restore the snapshot → mark actor RUNNING
  • SuspendActor — tell atelet to checkpoint → upload state → free the worker → mark actor SUSPENDED
  • DeleteActor — only works on SUSPENDED actors

How it finds workers and atelets

It runs Kubernetes informers watching worker pods (by label ate.dev/worker-pool), atelet pods (by label app=atelet in ate-system), and ActorTemplate CRDs. When resuming an actor, it looks up which node the target worker pod is on, finds the atelet pod on that node, and dials it directly via gRPC.

Concurrency model

Optimistic versioning on Redis records plus distributed locks (Redis SETNX with TTL) for multi-step resume/suspend workflows.


atelet — The Node-Level Supervisor

atelet runs as a privileged DaemonSet — one pod per node. It does not create or destroy worker pods (that's the controller's job). Instead, it's the bridge between the control plane and the physical sandbox runtime.

gRPC RPCs (the AteomHerder service)

ateapi calls these three RPCs:

  1. Run — boot a workload from scratch (no snapshot):

    • Downloads and SHA256-verifies the gVisor runsc binary from GCS/S3
    • Pulls container images (with a memory cache) and extracts them into OCI bundle rootfs directories
    • Generates OCI config.json specs with proper namespace configuration
    • Calls ateom-gvisor's RunWorkload RPC via Unix socket
  2. Restore — resume from a checkpoint:

    • Downloads checkpoint files (checkpoint.img, pages.img, pages_meta.img) from GCS/S3 with zstd decompression
    • Prepares OCI bundles (same as Run)
    • Calls ateom-gvisor's RestoreWorkload RPC
  3. Checkpoint — freeze and save state:

    • Calls ateom-gvisor's CheckpointWorkload RPC
    • Uploads checkpoint artifacts to GCS/S3 with zstd compression
    • Resets the actor's directories for the next workload

Shared filesystem

atelet and ateom-gvisor coordinate via a host-mounted directory at /run/ateom-gvisor/. This contains:

  • static-files/ — downloaded runsc binaries
  • ateoms/<pod-uid>/ateom.sock — Unix socket for gRPC
  • actors/<ns>:<template>:<id>/ — OCI bundles, checkpoint state, PID files, runsc state

ateom-gvisor — The In-Pod Sandbox Executor

ateom-gvisor is the primary container in each worker pod. It's the only thing that actually calls runsc commands. It runs privileged because it needs to manipulate network namespaces.

Startup sequence

  1. Creates a Unix socket at /run/ateom-gvisor/ateoms/<pod-uid>/ateom.sock
  2. Captures the pod's eth0 network configuration (addresses and routes)
  3. Creates an interior network namespace (ateom:<pod-uid>) for gVisor sandboxes
  4. Starts a child process reaper (since it's not PID 1)
  5. Serves the Ateom gRPC service

Three RPCs (each guarded by a mutex — one sandbox operation at a time)

  • RunWorkload: Creates a pause container + application containers via runsc create + runsc start. Moves eth0 from the pod netns into the interior netns so gVisor can use the pod's network identity.

  • CheckpointWorkload: Calls runsc checkpoint on the pause container (which captures the entire sandbox including all application containers). Then deletes all containers and moves eth0 back to the pod netns — leaving the worker clean for the next actor.

  • RestoreWorkload: Calls runsc restore with flags -background -direct -detach for fast resume. The -direct flag loads checkpoint pages straight into memory; -background returns immediately while demand-paging continues asynchronously.

Division of responsibility

ateom-gvisor only executes runsc commands. It does not pull images, download checkpoints, or upload state — that's all atelet's job. atelet prepares everything on the shared filesystem before calling ateom.


atenet — The Network Data Plane

atenet provides the magic that makes curl http://<actor-id>.actors.resources.substrate.ate.dev/ work, including auto-resuming suspended actors on first request.

Two subcommands (typically two separate pods)

atenet dns — Runs a CoreDNS instance that resolves *.actors.resources.substrate.ate.dev to the router's ClusterIP. It also patches the cluster's kube-dns ConfigMap to add a stub domain so all pods in the cluster can resolve actor hostnames. Reconciles every 10 seconds.

atenet router — The request routing brain, built on Envoy with External Processing:

  1. Manages an Envoy Deployment + Service via the Kubernetes API
  2. Runs an xDS server that dynamically configures Envoy's listeners, clusters, and routes
  3. Runs an ExtProc server (Envoy External Processing filter) that intercepts every request:
    • Extracts the actor ID from the Host header
    • Calls ateapi.ResumeActor() — this is a no-op if the actor is already running, or triggers a full restore if it's suspended
    • Uses singleflight to deduplicate concurrent resume calls for the same actor
    • Mutates the Host header to the worker pod's IP address
    • Envoy's dynamic forward proxy then routes the request to that IP

The complete request path

Client DNS lookup → CoreDNS (stub domain) → returns router ClusterIP
Client HTTP request → Envoy listener (port 8080)
  → ext_proc filter → ExtProc gRPC server → ateapi.ResumeActor()
  → Host header rewritten to worker pod IP
  → dynamic_forward_proxy → worker pod → actor handles request

This means actors are resumed on demand — the first request to a suspended actor triggers restore, and subsequent requests route directly. Error mapping is comprehensive: FAILED_PRECONDITION (no workers) → 503, DEADLINE_EXCEEDED → 504, NOT_FOUND → 404, etc.


Component Interaction Summary

                    ┌─────────────────┐
                    │  atecontroller   │
                    │  (Deployment)    │
                    └────────┬────────┘
                             │ gRPC: Create/Resume/SuspendActor
                             │ K8s: Manage Deployments
                             ▼
┌──────────┐  gRPC   ┌─────────────┐  K8s Informers  ┌──────────────┐
│  atenet   │────────▶│   ateapi     │◀───────────────▶│  K8s API     │
│  router   │ Resume  │  (Redis)     │  Pods, CRDs     │  Server      │
└──────────┘  Actor   └──────┬───────┘                 └──────────────┘
                             │ gRPC: Run/Checkpoint/Restore
                             ▼
                      ┌─────────────┐
                      │   atelet     │
                      │  (DaemonSet) │
                      └──────┬───────┘
                             │ gRPC over Unix socket
                             │ + shared /run/ateom-gvisor filesystem
                             ▼
                      ┌──────────────┐
                      │ ateom-gvisor │
                      │ (in worker   │
                      │  pod)        │
                      └──────┬───────┘
                             │ exec: runsc create/start/checkpoint/restore
                             ▼
                      ┌──────────────┐
                      │   gVisor     │
                      │  sandbox     │
                      └──────────────┘

Key design insight: Kubernetes manages infrastructure (pods, deployments, services) through the controller, but actor lifecycle is managed entirely through Redis + direct gRPC calls, bypassing etcd for the hot path. This lets the system scale to millions of actors while keeping resume latency under 100ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment