Kubernetes was designed for Google-scale web services in 2013. Slurm dates back to 2003. Both are built on antiquated technology stacks that predate modern kernel capabilities — io_uring, eBPF, cgroups v2, idmapped mounts — none of which they can take advantage of without layers of bolt-on complexity.
Most teams don't run Google. But they're stuck with Google's decade-old tooling.
- 15+ components to run GPU workloads on K8s
- Days to set up a production GPU cluster
- 2-3 full-time engineers just to maintain it
The Kubernetes GPU stack: kubelet + kube-proxy + etcd + CNI + GPU Operator + device plugin + KAI Scheduler + RDMA plugin + Multus + cert-manager + ...
Every team running AI workloads faces the same impossible choice: Kubernetes or Slurm. K8s needs 15 components just to schedule a GPU. Slurm is from 2003 — no containers, no secrets, no TLS. Both require a dedicated platform team. That's 15-20% of a small company's headcount just babysitting infrastructure.
This isn't a pitch for something we're going to build. This is built.
Written from scratch in Zig — a modern systems language that compiles to a single static binary with zero runtime dependencies. Zig gives us direct access to 2026 Linux kernel interfaces (io_uring, eBPF, cgroups v2, idmapped mounts) without the layers of abstraction that Go and C++ impose. The result: native kernel integration that Kubernetes architecturally cannot achieve.
| Metric | Value |
|---|---|
| Lines of Zig | 55,000 |
| Tests passing | 1,035 |
| Binary size | <15MB |
| Dependencies | 0 |
Capabilities:
- Full container runtime — namespaces, cgroups v2, overlayfs, seccomp
- OCI image pull/push/build — Dockerfile + TOML format
- eBPF networking — DNS, load balancing, network policy (no kube-proxy, no iptables)
- io_uring async I/O — zero-copy networking, native kernel event loop
- Raft clustering — consensus, SWIM gossip, WireGuard mesh
- Encrypted secrets — XChaCha20-Poly1305, rotation
- TLS termination — ACME/Let's Encrypt, auto-renewal
- Rolling deploys — health checks, automatic rollback
- Security audited — all critical/high issues resolved
$ scp yoq node-01:/usr/local/bin/
$ ssh node-01 "yoq serve --init"
server running on :7700, cluster token: ak7f...x2p
$ ssh node-02 "yoq join node-01:7700"
joined cluster, node_id=2, overlay=10.40.0.2
$ yoq up manifest.toml
deploying 3 services...
✓ postgres running 10.42.1.2:5432
✓ api running 10.42.2.3:8080
✓ web running 10.42.1.4:3000 → :443 (TLS)[service.db]
image = "postgres:16"
env = ["POSTGRES_PASSWORD=${DB_PASS}"]
volumes = ["data:/var/lib/postgresql/data"]
[service.api]
image = "myapp/api:latest"
depends_on = ["db"]
health_check = { http = { path = "/health", port = 8080 } }
[service.web]
image = "myapp/web:latest"
depends_on = ["api"]
tls = { domain = "app.example.com", acme = true }Same simplicity. Applied to the fastest-growing infrastructure market.
Kubernetes + GPU Stack: GPU Operator, NVIDIA Device Plugin, KAI Scheduler, RDMA Device Plugin, Multus CNI, DCGM Exporter, Network Operator, Custom NCCL configs
yoq:
[service.training]
image = "pytorch-dist:latest"
replicas = 100
gpu = { count = 1, model = "H100" }
gpu.mesh = { enabled = true, backend = "nccl" }
[service.training.checkpoint]
path = "/mnt/storage/checkpoints"
interval = "30m"GPU detection, InfiniBand RDMA, NCCL topology, gang scheduling, checkpointing, fault recovery. All in the binary. All from the manifest.
$7-8B container orchestration market, growing 30%+ YoY
Two wedges:
- Broad — Kubernetes replacement for 10-500 node teams
- Deep — GPU orchestration for AI training & inference
Who needs this:
- AI/ML — Teams training on 50-500 GPUs
- SaaS — Companies overpaying for managed K8s
- On-prem — Regulated industries (finance, defense, healthcare)
- Edge — GPU inference at the edge, no cloud dependency
The "overserved by K8s" segment is 20-30% of total market. Capturing 1-2% = $15-30M ARR.
| yoq | Kubernetes | Slurm | Nomad | |
|---|---|---|---|---|
| Setup time | Minutes | Days | Hours | Hours |
| Components | 1 binary | 15+ | 2 daemons | 3 (+ Consul + Vault) |
| GPU scheduling | Built-in | 3 add-ons | Native | Basic plugin |
| InfiniBand | Built-in | RDMA plugin + Multus | Native | None |
| Service discovery | eBPF (built-in) | CoreDNS | None | Requires Consul |
| TLS + Secrets | Built-in | 2 add-ons | None | Requires Vault |
| Fault recovery | Auto checkpoint | Pod eviction | Manual requeue | Restart only |
yoq is the only self-contained GPU orchestrator that doesn't need another orchestrator underneath it.
Kacy — Founder & CEO
- Google Cloud (Current) — Lead, Cloud Alerting & Cloud Notifications. Owns the alerting and notification systems for all of Google Cloud Platform.
- Google Distributed Cloud — One of the Engineering Leads. Built and shipped private cloud infrastructure for enterprises. Billion-dollar deals. Saw firsthand what happens when you architect cloud infrastructure poorly — and what it costs to fix.
- Fitbit — Owned cloud infrastructure for a few years, running Fitbit's massive Kubernetes cluster at scale. Lived the pain of K8s operations from the operator side — the very pain yoq eliminates.
- Why yoq — After years of watching teams drown in Kubernetes complexity — both as a builder and an operator — I decided to build the infrastructure I wish existed. 55,000 lines of working code is the proof.
Open Source — Full orchestrator. Runtime, networking, clustering, GPU mesh. Free forever.
Enterprise — Multi-cluster federation, audit logging, SSO/RBAC, SLAs. $500-2K/node/year.
Cloud — Managed yoq clusters. One-click GPU training infra.
Comparable outcomes:
- HashiCorp (Nomad) — Acquired by IBM for $6.4B — same OSS-core model
- CoreWeave — $35B valuation running GPU infrastructure
- Replicated — $1B+ valuation, enterprise K8s tooling
Use of funds:
- Engineering — 2 systems engineers (Zig/Linux/eBPF)
- Infrastructure — 500-node scale validation
- Security — GPU isolation audit
- Customers — 5-10 design partners
Milestones:
- Month 3 — GPU mesh working, design partner agreements
- Month 6 — v1.0 shipped, 500-node validated
- Month 9 — First enterprise contracts
- Year 1 — $500K-1M ARR from 5-10 customers
The GPU infrastructure market is massive, chaotic, and hungry for simplicity.
We're building the obvious answer for the 90% of teams that don't need Kubernetes.
55,000 lines of working code. Zero dependencies. Ship it with scp.