Stand up the agent-substrate micro-VM runtime (cloud-hypervisor + kata-agent, PR #287) on a kind cluster on bigbox and run two demos, then tear it down:
- Part A — counter-microvm: in-RAM counter → suspend (VM memory snapshot) → resume on another worker → count continues. The runtime's own demo; pure substrate.
- Part B — helpdesk-microvm: the OpenShell helpdesk agent running as a micro-VM actor →
/status+/chat(realgpt-oss:20b-cloudcompletions via Ollama Cloud) → suspend → resume → state continues. Proves a real ~3 GB agent workload boots, does name-based egress, snapshots, and restores under the micro-VM runtime.
Validated top-to-bottom on bigbox 2026-06-25 (3 consecutive clean runs). Run everything as
root on bigbox (ssh bigbox; Ubuntu 24.04 with /dev/kvm). Each command block is meant to be
pasted into the same shell session (later steps use shell variables set by earlier ones). kubectl / kubectl ate use your current kube-context, which create-kind-cluster.sh (A1) sets to kind-kind.
Times are warm-cache; a cold first run adds a few minutes (kata assets, kind node image, cargo build).
| repo (clone URL) | branch / rev | needed for |
|---|---|---|
https://github.com/dims/substrate |
fix/microvm-guest-dns (96db6d3) — PR #287 + multi-GB-image fix + guest-DNS fix (the last is needed for /chat) |
A + B |
https://github.com/dims/OpenShell |
e8ec295 — skip-config supervisor (PR #1549) |
B |
https://github.com/dims/openshell-driver-substrate |
main (pins openshell-core e8ec295) — helpdesk image recipe + supervisor build |
B |
No gateway / driver crate is needed — both actors are driven directly with kubectl ate.
ls -l /dev/kvm && cat /sys/module/kvm_amd/parameters/nested # device present, nested = 1
go version # go 1.26.x
docker buildx version >/dev/null && echo ok
git --version ; aws --version # aws stages assets into in-cluster rustfs (S3)
systemctl is-active shorewall # must be: inactive (else Gotcha 1)
# Part B also:
cargo --version || curl https://sh.rustup.rs -sSf | sh # rust 1.96
protoc --version || apt-get install -y protobuf-compiler # build-image.sh hard-requires protoc
test -s ~/.config/ollama/key && echo "ollama key present" # /chat needs a real Ollama Cloud key here (free: https://ollama.com/settings)kind itself is built on demand by the repo (hack/run-tool.sh) — not needed on PATH.
git clone --filter=blob:none https://github.com/dims/substrate.git ~/go/src/github.com/agent-substrate/substrate
git clone --filter=blob:none https://github.com/dims/openshell-driver-substrate.git ~/go/src/github.com/dims/openshell-driver-substrate # Part B
git clone --filter=blob:none https://github.com/dims/OpenShell.git ~/go/src/github.com/dims/OpenShell # Part B
git -C ~/go/src/github.com/agent-substrate/substrate checkout fix/microvm-guest-dns
git -C ~/go/src/github.com/dims/OpenShell fetch origin e8ec295906582edaf51fda0c077759d4437e30a6
git -C ~/go/src/github.com/dims/OpenShell checkout e8ec295906582edaf51fda0c077759d4437e30a6
# the driver stays on main(On bigbox these may already exist — then just git -C <dir> fetch and re-run the checkouts.)
cd ~/go/src/github.com/agent-substrate/substrate
hack/create-kind-cluster.sh # kind + registry:5001 + mounts /dev/kvm + labels node sandboxClass=microvm
hack/install-ate-kind.sh --deploy-ate-system # CRDs, apiserver, atelet, atenet, valkey, rustfs — BLOCKS until Ready
kubectl get pods -n ate-system # sanity: all Running (6 valkey-cluster pods 1/1, rustfs 1/1, apiserver, controller, atelet, atenet)hack/run-microvm-demo-kind.shBuilds ateom-base (debian-slim + e2fsprogs), assembles the 4 assets (cloud-hypervisor
v52.0 + kata-static 3.31.0 kernel/rootfs/config), stages them to rustfs
ate-snapshots/kata-assets/, re-installs (idempotent), applies
demos/counter/counter-microvm.yaml.tmpl. Run after A1 — the wrapper stages to rustfs
before its own install step, so rustfs must already exist.
cd ~/go/src/github.com/agent-substrate/substrate
kubectl wait --for=condition=Ready actortemplate/counter-microvm -n ate-demo-counter-microvm --timeout=600s
export PATH=$PATH:$(go env GOPATH)/bin
go install ./cmd/kubectl-ate
H=my-counter-1.actors.resources.substrate.ate.dev
kubectl ate create actor my-counter-1 --template ate-demo-counter-microvm/counter-microvm
kubectl port-forward -n ate-system svc/atenet-router 8000:80 >/tmp/pf.log 2>&1 &
curl -s -X POST -H "Host: $H" http://localhost:8000 # preserved memory count: 1
curl -s -X POST -H "Host: $H" http://localhost:8000 # 2
curl -s -X POST -H "Host: $H" http://localhost:8000 # 3
kubectl ate suspend actor my-counter-1 # -> STATUS_SUSPENDED (cloud-hypervisor VM snapshot)
curl -s -X POST -H "Host: $H" http://localhost:8000 # preserved memory count: 4 ← count continued across the snapshotReuses Part A's cluster: the counter-microvm SandboxConfig + staged kata assets + the
ateom-microvm image. Only the workload (an OpenShell supervisor + helpdesk agent) changes.
Run B1→B3 in one shell — $SUP/$HELP/$ATEOM carry across the steps.
cd ~/go/src/github.com/dims/openshell-driver-substrate
SUP=$(OPENSHELL_REPO=~/go/src/github.com/dims/OpenShell tests/integration/build-image.sh) # cargo build + docker + push
echo "$SUP" # localhost:5001/oshl-feature-test@sha256:...
cd examples/helpdesk
# Source the key from an env var (api_key_env) — no secret baked into the image; the key is injected at B2.
sed 's|api_key: <your-ollama-cloud-key>|api_key_env: OPENSHELL_INFERENCE_API_KEY|' routes.yaml > routes.local.yaml
docker build --build-arg BASE="$SUP" -t localhost:5001/oshl-helpdesk:latest -f helpdesk.Dockerfile .
# (a cosmetic "InvalidDefaultArgInFrom: ... ARG ${BASE}" warning here is expected — BASE is supplied via --build-arg)
docker push localhost:5001/oshl-helpdesk:latest
HELP=$(docker inspect --format='{{index .RepoDigests 0}}' localhost:5001/oshl-helpdesk:latest)HELP=$(docker inspect --format='{{index .RepoDigests 0}}' localhost:5001/oshl-helpdesk:latest) # re-derived so B2 stands alone
ATEOM=$(kubectl get workerpool counter-microvm -n ate-demo-counter-microvm -o jsonpath='{.spec.ateomImage}')
KEY=$(cat ~/.config/ollama/key) # real Ollama key, threaded as env (never baked into the image)
cat <<YAML | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata: { name: ate-demo-helpdesk-microvm }
---
apiVersion: ate.dev/v1alpha1
kind: WorkerPool
metadata:
name: helpdesk-microvm
namespace: ate-demo-helpdesk-microvm
labels: { workload: helpdesk-microvm }
spec:
replicas: 2
sandboxClass: microvm
sandboxConfigName: counter-microvm # reuse Part A's cluster-scoped SandboxConfig (kata assets)
ateomImage: ${ATEOM} # reuse the ateom-microvm image built in A2 (has e2fsprogs)
---
apiVersion: ate.dev/v1alpha1
kind: ActorTemplate
metadata:
name: helpdesk-microvm
namespace: ate-demo-helpdesk-microvm
spec:
sandboxClass: microvm
pauseImage: registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4
containers:
- name: supervisor
image: ${HELP}
command: ["/usr/local/bin/openshell-sandbox","--policy-rules","/etc/openshell/policy.rego","--policy-data","/etc/openshell/data.yaml","--inference-routes","/etc/openshell/routes.yaml","--log-level","info","--","python3","/opt/helpdesk/agent.py"]
env:
- { name: OPENSHELL_SKIP_BOOTSTRAP, value: "all" }
- { name: OPENSHELL_INFERENCE_API_KEY, value: "${KEY}" } # consumed by routes.yaml api_key_env
workerSelector:
matchLabels: { workload: helpdesk-microvm }
snapshotsConfig:
location: gs://ate-snapshots/ate-demo-helpdesk-microvm/
YAML
kubectl wait --for=condition=Ready actortemplate/helpdesk-microvm -n ate-demo-helpdesk-microvm --timeout=600sGolden boot of the ~3 GB image takes ~75s (builds the ext4 rootfs + boots CLH + checkpoints).
export PATH=$PATH:$(go env GOPATH)/bin
H=hd-user-1.actors.resources.substrate.ate.dev
kubectl ate create actor hd-user-1 --template ate-demo-helpdesk-microvm/helpdesk-microvm
kubectl port-forward -n ate-system svc/atenet-router 8000:80 >/tmp/pf.log 2>&1 &
# First request kicks the resume; it's ASYNC (~80s for 3 GB) — poll for RUNNING, then read.
curl -s -m 5 -H "Host: $H" http://localhost:8000/status # triggers resume (may return empty)
until [ "$(kubectl ate get actor hd-user-1 | awk 'NR==2{print $4}')" = STATUS_RUNNING ]; do sleep 3; done
curl -s -H "Host: $H" http://localhost:8000/status # {"history_turns": 0, "uptime_seconds": ..., "model": "gpt-oss:20b-cloud"}
curl -s -XPOST -H "Host: $H" http://localhost:8000/chat -d '{"message":"say hello in three words"}' # {"reply": "...", "history_turns": 1} ← real LLM completion
kubectl ate suspend actor hd-user-1 # -> STATUS_SUSPENDED (only when RUNNING)
curl -s -m 5 -H "Host: $H" http://localhost:8000/status # kick resume
until [ "$(kubectl ate get actor hd-user-1 | awk 'NR==2{print $4}')" = STATUS_RUNNING ]; do sleep 3; done
curl -s -H "Host: $H" http://localhost:8000/status # uptime_seconds CONTINUED ← VM memory snapshot round-tripped
curl -s -XPOST -H "Host: $H" http://localhost:8000/chat -d '{"message":"name one primary color"}' # reply + history_turns keeps climbing ← chat state survived tooBoth /status and /chat work; suspend/resume preserves the in-VM chat history (history_turns keeps climbing).
cd ~/go/src/github.com/agent-substrate/substrate
export PATH=$PATH:$(go env GOPATH)/bin
for a in my-counter-1 hd-user-1; do # kubectl-ate resolves actors by ID
kubectl ate suspend actor "$a" 2>/dev/null; sleep 4
kubectl ate delete actor "$a" 2>/dev/null
done
kill %1 2>/dev/null # stop the port-forward
hack/delete-kind-cluster.sh || true # deletes the cluster + kind-registry
docker rm -f kind-control-plane kind-registry 2>/dev/null || true # belt-and-suspenders (a non-interactive delete can leave these)
rm -rf /var/lib/ateom-* /run/ateom-* /run/vcdelete-kind-cluster.sh removes everything regardless, so the actor steps are just tidy-up.
Verify: docker ps -a | grep kind (none), ls /var/lib/ateom-* (none).
- shorewall firewall (NVIDIA hosts). If active it drops the docker bridge and the kube API
hangs (TLS handshake timeout). One-time:
systemctl mask --now shorewall; shorewall clear; iptables -P INPUT ACCEPT; iptables -P FORWARD ACCEPT; iptables -P OUTPUT ACCEPT; iptables -F FORWARD; systemctl restart docker(already masked on bigbox). - Run order —
run-microvm-demo-kind.shstages to rustfs before installing; do A1 before A2. awsCLI is an undocumented prereq of asset staging;protocof the supervisor build.- Nothing else may hold :9000 — a stray
minio/registry there makes the rustfs port-forward fail with a misleadingInvalidAccessKeyId. kubectl ate delete actorneeds STATUS_SUSPENDED — suspend first (or just delete the cluster).- Large workload images (helpdesk ~3 GB) used to SIGKILL
mkfs.ext4on resume — fixed (deadline- detached rootfs build at 3 sites + bumped resume budgets), carried by thefix/microvm-guest-dnsbranch; counter is tiny and unaffected. - Micro-VM resume is async (~80s first time for 3 GB). The first request triggers resume and
may return empty — poll
kubectl ate get actorforSTATUS_RUNNING, then read. Don'tsuspendmid-resume (Aborted: another operation in progress). /chategress (DNS + key) — fixed, not a gap. Root cause was NOT the proxy (it works underskip=all): the guest had interface + route + NAT but no/etc/resolv.conf, so the supervisor's proxy could reach IPs but not resolveollama.com. Thefix/microvm-guest-dnscommit writes the worker pod's resolv.conf into the guest rootfs (golden + restore paths), exactly like the kata shim'sgetDNS. The real key is threaded via env — routes.yamlapi_key_env: OPENSHELL_INFERENCE_API_KEY+ the ActorTemplate env rendered from~/.config/ollama/key(no secret baked in the image; it does land in the Actor CR — a k8s Secret would need ActorTemplatevalueFrom, a future enhancement).- One workload container per micro-VM actor (pause excluded); helpdesk (pause + supervisor) fits.
- ko prints
git is in a dirty state … ?? ateom-microvm— cosmetic; only affects version stamping.