Micro-VM (substrate PR #287) on bigbox — runbook

Stand up the agent-substrate micro-VM runtime (cloud-hypervisor + kata-agent, PR #287) on a kind cluster on bigbox and run two demos, then tear it down:

Part A — counter-microvm: in-RAM counter → suspend (VM memory snapshot) → resume on another worker → count continues. The runtime's own demo; pure substrate.
Part B — helpdesk-microvm: the OpenShell helpdesk agent running as a micro-VM actor → /status + /chat (real gpt-oss:20b-cloud completions via Ollama Cloud) → suspend → resume → state continues. Proves a real ~3 GB agent workload boots, does name-based egress, snapshots, and restores under the micro-VM runtime.

Validated top-to-bottom on bigbox 2026-06-25 (3 consecutive clean runs). Run everything as root on bigbox (ssh bigbox; Ubuntu 24.04 with /dev/kvm). Each command block is meant to be pasted into the same shell session (later steps use shell variables set by earlier ones). kubectl / kubectl ate use your current kube-context, which create-kind-cluster.sh (A1) sets to kind-kind. Times are warm-cache; a cold first run adds a few minutes (kata assets, kind node image, cargo build).

Branches (minimal, consolidated)

repo (clone URL)	branch / rev	needed for
`https://github.com/dims/substrate`	`fix/microvm-guest-dns` (`96db6d3`) — PR #287 + multi-GB-image fix + guest-DNS fix (the last is needed for `/chat`)	A + B
`https://github.com/dims/OpenShell`	`e8ec295` — skip-config supervisor (PR #1549)	B
`https://github.com/dims/openshell-driver-substrate`	`main` (pins openshell-core `e8ec295`) — helpdesk image recipe + supervisor build	B

No gateway / driver crate is needed — both actors are driven directly with kubectl ate.

Requirements (bigbox already has all of these — verify; install if on a fresh host)

ls -l /dev/kvm && cat /sys/module/kvm_amd/parameters/nested   # device present, nested = 1
go version          # go 1.26.x
docker buildx version >/dev/null && echo ok
git --version ; aws --version                                 # aws stages assets into in-cluster rustfs (S3)
systemctl is-active shorewall                                 # must be: inactive   (else Gotcha 1)
# Part B also:
cargo --version || curl https://sh.rustup.rs -sSf | sh        # rust 1.96
protoc --version || apt-get install -y protobuf-compiler      # build-image.sh hard-requires protoc
test -s ~/.config/ollama/key && echo "ollama key present"     # /chat needs a real Ollama Cloud key here (free: https://ollama.com/settings)

kind itself is built on demand by the repo (hack/run-tool.sh) — not needed on PATH.

0. Clone the repos (once)

git clone --filter=blob:none https://github.com/dims/substrate.git                 ~/go/src/github.com/agent-substrate/substrate
git clone --filter=blob:none https://github.com/dims/openshell-driver-substrate.git ~/go/src/github.com/dims/openshell-driver-substrate   # Part B
git clone --filter=blob:none https://github.com/dims/OpenShell.git                  ~/go/src/github.com/dims/OpenShell                     # Part B

git -C ~/go/src/github.com/agent-substrate/substrate checkout fix/microvm-guest-dns
git -C ~/go/src/github.com/dims/OpenShell fetch origin e8ec295906582edaf51fda0c077759d4437e30a6
git -C ~/go/src/github.com/dims/OpenShell checkout e8ec295906582edaf51fda0c077759d4437e30a6
# the driver stays on main

(On bigbox these may already exist — then just git -C <dir> fetch and re-run the checkouts.)

Part A — counter-microvm

A1. Cluster + control plane

cd ~/go/src/github.com/agent-substrate/substrate
hack/create-kind-cluster.sh                    # kind + registry:5001 + mounts /dev/kvm + labels node sandboxClass=microvm
hack/install-ate-kind.sh --deploy-ate-system   # CRDs, apiserver, atelet, atenet, valkey, rustfs — BLOCKS until Ready
kubectl get pods -n ate-system                 # sanity: all Running (6 valkey-cluster pods 1/1, rustfs 1/1, apiserver, controller, atelet, atenet)

A2. Build assets + deploy the demo

hack/run-microvm-demo-kind.sh

Builds ateom-base (debian-slim + e2fsprogs), assembles the 4 assets (cloud-hypervisor v52.0 + kata-static 3.31.0 kernel/rootfs/config), stages them to rustfs ate-snapshots/kata-assets/, re-installs (idempotent), applies demos/counter/counter-microvm.yaml.tmpl. Run after A1 — the wrapper stages to rustfs before its own install step, so rustfs must already exist.

A3. Golden snapshot + drive

cd ~/go/src/github.com/agent-substrate/substrate
kubectl wait --for=condition=Ready actortemplate/counter-microvm -n ate-demo-counter-microvm --timeout=600s
export PATH=$PATH:$(go env GOPATH)/bin
go install ./cmd/kubectl-ate
H=my-counter-1.actors.resources.substrate.ate.dev
kubectl ate create actor my-counter-1 --template ate-demo-counter-microvm/counter-microvm
kubectl port-forward -n ate-system svc/atenet-router 8000:80 >/tmp/pf.log 2>&1 &
curl -s -X POST -H "Host: $H" http://localhost:8000   # preserved memory count: 1
curl -s -X POST -H "Host: $H" http://localhost:8000   # 2
curl -s -X POST -H "Host: $H" http://localhost:8000   # 3
kubectl ate suspend actor my-counter-1                # -> STATUS_SUSPENDED (cloud-hypervisor VM snapshot)
curl -s -X POST -H "Host: $H" http://localhost:8000   # preserved memory count: 4  ← count continued across the snapshot

Part B — helpdesk-microvm (OpenShell)

Reuses Part A's cluster: the counter-microvm SandboxConfig + staged kata assets + the ateom-microvm image. Only the workload (an OpenShell supervisor + helpdesk agent) changes. Run B1→B3 in one shell — $SUP/$HELP/$ATEOM carry across the steps.

B1. Build the supervisor + helpdesk images (→ local registry :5001)

cd ~/go/src/github.com/dims/openshell-driver-substrate
SUP=$(OPENSHELL_REPO=~/go/src/github.com/dims/OpenShell tests/integration/build-image.sh)   # cargo build + docker + push
echo "$SUP"   # localhost:5001/oshl-feature-test@sha256:...

cd examples/helpdesk
# Source the key from an env var (api_key_env) — no secret baked into the image; the key is injected at B2.
sed 's|api_key: <your-ollama-cloud-key>|api_key_env: OPENSHELL_INFERENCE_API_KEY|' routes.yaml > routes.local.yaml
docker build --build-arg BASE="$SUP" -t localhost:5001/oshl-helpdesk:latest -f helpdesk.Dockerfile .
# (a cosmetic "InvalidDefaultArgInFrom: ... ARG ${BASE}" warning here is expected — BASE is supplied via --build-arg)
docker push localhost:5001/oshl-helpdesk:latest
HELP=$(docker inspect --format='{{index .RepoDigests 0}}' localhost:5001/oshl-helpdesk:latest)

B2. Apply the micro-VM manifest (reuses counter's SandboxConfig + ateom image)

HELP=$(docker inspect --format='{{index .RepoDigests 0}}' localhost:5001/oshl-helpdesk:latest)   # re-derived so B2 stands alone
ATEOM=$(kubectl get workerpool counter-microvm -n ate-demo-counter-microvm -o jsonpath='{.spec.ateomImage}')
KEY=$(cat ~/.config/ollama/key)                  # real Ollama key, threaded as env (never baked into the image)
cat <<YAML | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata: { name: ate-demo-helpdesk-microvm }
---
apiVersion: ate.dev/v1alpha1
kind: WorkerPool
metadata:
  name: helpdesk-microvm
  namespace: ate-demo-helpdesk-microvm
  labels: { workload: helpdesk-microvm }
spec:
  replicas: 2
  sandboxClass: microvm
  sandboxConfigName: counter-microvm        # reuse Part A's cluster-scoped SandboxConfig (kata assets)
  ateomImage: ${ATEOM}                       # reuse the ateom-microvm image built in A2 (has e2fsprogs)
---
apiVersion: ate.dev/v1alpha1
kind: ActorTemplate
metadata:
  name: helpdesk-microvm
  namespace: ate-demo-helpdesk-microvm
spec:
  sandboxClass: microvm
  pauseImage: registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4
  containers:
  - name: supervisor
    image: ${HELP}
    command: ["/usr/local/bin/openshell-sandbox","--policy-rules","/etc/openshell/policy.rego","--policy-data","/etc/openshell/data.yaml","--inference-routes","/etc/openshell/routes.yaml","--log-level","info","--","python3","/opt/helpdesk/agent.py"]
    env:
    - { name: OPENSHELL_SKIP_BOOTSTRAP, value: "all" }
    - { name: OPENSHELL_INFERENCE_API_KEY, value: "${KEY}" }   # consumed by routes.yaml api_key_env
  workerSelector:
    matchLabels: { workload: helpdesk-microvm }
  snapshotsConfig:
    location: gs://ate-snapshots/ate-demo-helpdesk-microvm/
YAML

kubectl wait --for=condition=Ready actortemplate/helpdesk-microvm -n ate-demo-helpdesk-microvm --timeout=600s

Golden boot of the ~3 GB image takes ~75s (builds the ext4 rootfs + boots CLH + checkpoints).

B3. Drive — create → /status → suspend → resume

export PATH=$PATH:$(go env GOPATH)/bin
H=hd-user-1.actors.resources.substrate.ate.dev
kubectl ate create actor hd-user-1 --template ate-demo-helpdesk-microvm/helpdesk-microvm
kubectl port-forward -n ate-system svc/atenet-router 8000:80 >/tmp/pf.log 2>&1 &

# First request kicks the resume; it's ASYNC (~80s for 3 GB) — poll for RUNNING, then read.
curl -s -m 5 -H "Host: $H" http://localhost:8000/status   # triggers resume (may return empty)
until [ "$(kubectl ate get actor hd-user-1 | awk 'NR==2{print $4}')" = STATUS_RUNNING ]; do sleep 3; done
curl -s -H "Host: $H" http://localhost:8000/status        # {"history_turns": 0, "uptime_seconds": ..., "model": "gpt-oss:20b-cloud"}
curl -s -XPOST -H "Host: $H" http://localhost:8000/chat -d '{"message":"say hello in three words"}'   # {"reply": "...", "history_turns": 1}  ← real LLM completion

kubectl ate suspend actor hd-user-1                        # -> STATUS_SUSPENDED (only when RUNNING)
curl -s -m 5 -H "Host: $H" http://localhost:8000/status   # kick resume
until [ "$(kubectl ate get actor hd-user-1 | awk 'NR==2{print $4}')" = STATUS_RUNNING ]; do sleep 3; done
curl -s -H "Host: $H" http://localhost:8000/status        # uptime_seconds CONTINUED ← VM memory snapshot round-tripped
curl -s -XPOST -H "Host: $H" http://localhost:8000/chat -d '{"message":"name one primary color"}'    # reply + history_turns keeps climbing ← chat state survived too

Both /status and /chat work; suspend/resume preserves the in-VM chat history (history_turns keeps climbing).

Teardown (leave bigbox clean)

cd ~/go/src/github.com/agent-substrate/substrate
export PATH=$PATH:$(go env GOPATH)/bin
for a in my-counter-1 hd-user-1; do            # kubectl-ate resolves actors by ID
  kubectl ate suspend actor "$a" 2>/dev/null; sleep 4
  kubectl ate delete  actor "$a" 2>/dev/null
done
kill %1 2>/dev/null                                  # stop the port-forward
hack/delete-kind-cluster.sh || true                  # deletes the cluster + kind-registry
docker rm -f kind-control-plane kind-registry 2>/dev/null || true   # belt-and-suspenders (a non-interactive delete can leave these)
rm -rf /var/lib/ateom-* /run/ateom-* /run/vc

delete-kind-cluster.sh removes everything regardless, so the actor steps are just tidy-up. Verify: docker ps -a | grep kind (none), ls /var/lib/ateom-* (none).

Gotchas

shorewall firewall (NVIDIA hosts). If active it drops the docker bridge and the kube API hangs (TLS handshake timeout). One-time: systemctl mask --now shorewall; shorewall clear; iptables -P INPUT ACCEPT; iptables -P FORWARD ACCEPT; iptables -P OUTPUT ACCEPT; iptables -F FORWARD; systemctl restart docker (already masked on bigbox).
Run order — run-microvm-demo-kind.sh stages to rustfs before installing; do A1 before A2.
aws CLI is an undocumented prereq of asset staging; protoc of the supervisor build.
Nothing else may hold :9000 — a stray minio/registry there makes the rustfs port-forward fail with a misleading InvalidAccessKeyId.
kubectl ate delete actor needs STATUS_SUSPENDED — suspend first (or just delete the cluster).
Large workload images (helpdesk ~3 GB) used to SIGKILL mkfs.ext4 on resume — fixed (deadline- detached rootfs build at 3 sites + bumped resume budgets), carried by the fix/microvm-guest-dns branch; counter is tiny and unaffected.
Micro-VM resume is async (~80s first time for 3 GB). The first request triggers resume and may return empty — poll kubectl ate get actor for STATUS_RUNNING, then read. Don't suspend mid-resume (Aborted: another operation in progress).
/chat egress (DNS + key) — fixed, not a gap. Root cause was NOT the proxy (it works under skip=all): the guest had interface + route + NAT but no /etc/resolv.conf, so the supervisor's proxy could reach IPs but not resolve ollama.com. The fix/microvm-guest-dns commit writes the worker pod's resolv.conf into the guest rootfs (golden + restore paths), exactly like the kata shim's getDNS. The real key is threaded via env — routes.yaml api_key_env: OPENSHELL_INFERENCE_API_KEY + the ActorTemplate env rendered from ~/.config/ollama/key (no secret baked in the image; it does land in the Actor CR — a k8s Secret would need ActorTemplate valueFrom, a future enhancement).
One workload container per micro-VM actor (pause excluded); helpdesk (pause + supervisor) fits.
ko prints git is in a dirty state … ?? ateom-microvm — cosmetic; only affects version stamping.

dims/2026-06-25-microvm-pr287-runbook.md

Select an option

No results found