Skip to content

Instantly share code, notes, and snippets.

@kacy
Created March 10, 2026 22:29
Show Gist options
  • Select an option

  • Save kacy/c0322b4375a85c6b50ca2ee2ca2e0358 to your computer and use it in GitHub Desktop.

Select an option

Save kacy/c0322b4375a85c6b50ca2ee2ca2e0358 to your computer and use it in GitHub Desktop.

yoq — GPU Infrastructure Without the Kubernetes Tax


The Problem

Kubernetes was designed for Google-scale web services in 2013. Slurm dates back to 2003. Both are built on antiquated technology stacks that predate modern kernel capabilities — io_uring, eBPF, cgroups v2, idmapped mounts — none of which they can take advantage of without layers of bolt-on complexity.

Most teams don't run Google. But they're stuck with Google's decade-old tooling.

  • 15+ components to run GPU workloads on K8s
  • Days to set up a production GPU cluster
  • 2-3 full-time engineers just to maintain it

The Kubernetes GPU stack: kubelet + kube-proxy + etcd + CNI + GPU Operator + device plugin + KAI Scheduler + RDMA plugin + Multus + cert-manager + ...

Every team running AI workloads faces the same impossible choice: Kubernetes or Slurm. K8s needs 15 components just to schedule a GPU. Slurm is from 2003 — no containers, no secrets, no TLS. Both require a dedicated platform team. That's 15-20% of a small company's headcount just babysitting infrastructure.


What yoq Does Today

This isn't a pitch for something we're going to build. This is built.

Written from scratch in Zig — a modern systems language that compiles to a single static binary with zero runtime dependencies. Zig gives us direct access to 2026 Linux kernel interfaces (io_uring, eBPF, cgroups v2, idmapped mounts) without the layers of abstraction that Go and C++ impose. The result: native kernel integration that Kubernetes architecturally cannot achieve.

Metric Value
Lines of Zig 55,000
Tests passing 1,035
Binary size <15MB
Dependencies 0

Capabilities:

  • Full container runtime — namespaces, cgroups v2, overlayfs, seccomp
  • OCI image pull/push/build — Dockerfile + TOML format
  • eBPF networking — DNS, load balancing, network policy (no kube-proxy, no iptables)
  • io_uring async I/O — zero-copy networking, native kernel event loop
  • Raft clustering — consensus, SWIM gossip, WireGuard mesh
  • Encrypted secrets — XChaCha20-Poly1305, rotation
  • TLS termination — ACME/Let's Encrypt, auto-renewal
  • Rolling deploys — health checks, automatic rollback
  • Security audited — all critical/high issues resolved

Deploy in 4 Commands

$ scp yoq node-01:/usr/local/bin/
$ ssh node-01 "yoq serve --init"
  server running on :7700, cluster token: ak7f...x2p

$ ssh node-02 "yoq join node-01:7700"
  joined cluster, node_id=2, overlay=10.40.0.2

$ yoq up manifest.toml
  deploying 3 services...
  ✓ postgres  running  10.42.1.2:5432
  ✓ api       running  10.42.2.3:8080
  ✓ web       running  10.42.1.4:3000  → :443 (TLS)
[service.db]
image = "postgres:16"
env = ["POSTGRES_PASSWORD=${DB_PASS}"]
volumes = ["data:/var/lib/postgresql/data"]

[service.api]
image = "myapp/api:latest"
depends_on = ["db"]
health_check = { http = { path = "/health", port = 8080 } }

[service.web]
image = "myapp/web:latest"
depends_on = ["api"]
tls = { domain = "app.example.com", acme = true }

The Business: GPU Mesh

Same simplicity. Applied to the fastest-growing infrastructure market.

Kubernetes + GPU Stack: GPU Operator, NVIDIA Device Plugin, KAI Scheduler, RDMA Device Plugin, Multus CNI, DCGM Exporter, Network Operator, Custom NCCL configs

yoq:

[service.training]
image = "pytorch-dist:latest"
replicas = 100
gpu = { count = 1, model = "H100" }
gpu.mesh = { enabled = true, backend = "nccl" }

[service.training.checkpoint]
path = "/mnt/storage/checkpoints"
interval = "30m"

GPU detection, InfiniBand RDMA, NCCL topology, gang scheduling, checkpointing, fault recovery. All in the binary. All from the manifest.


The Market

$7-8B container orchestration market, growing 30%+ YoY

Two wedges:

  • Broad — Kubernetes replacement for 10-500 node teams
  • Deep — GPU orchestration for AI training & inference

Who needs this:

  • AI/ML — Teams training on 50-500 GPUs
  • SaaS — Companies overpaying for managed K8s
  • On-prem — Regulated industries (finance, defense, healthcare)
  • Edge — GPU inference at the edge, no cloud dependency

The "overserved by K8s" segment is 20-30% of total market. Capturing 1-2% = $15-30M ARR.


Competition

yoq Kubernetes Slurm Nomad
Setup time Minutes Days Hours Hours
Components 1 binary 15+ 2 daemons 3 (+ Consul + Vault)
GPU scheduling Built-in 3 add-ons Native Basic plugin
InfiniBand Built-in RDMA plugin + Multus Native None
Service discovery eBPF (built-in) CoreDNS None Requires Consul
TLS + Secrets Built-in 2 add-ons None Requires Vault
Fault recovery Auto checkpoint Pod eviction Manual requeue Restart only

yoq is the only self-contained GPU orchestrator that doesn't need another orchestrator underneath it.


Why Me

Kacy — Founder & CEO

  • Google Cloud (Current) — Lead, Cloud Alerting & Cloud Notifications. Owns the alerting and notification systems for all of Google Cloud Platform.
  • Google Distributed Cloud — One of the Engineering Leads. Built and shipped private cloud infrastructure for enterprises. Billion-dollar deals. Saw firsthand what happens when you architect cloud infrastructure poorly — and what it costs to fix.
  • Fitbit — Owned cloud infrastructure for a few years, running Fitbit's massive Kubernetes cluster at scale. Lived the pain of K8s operations from the operator side — the very pain yoq eliminates.
  • Why yoq — After years of watching teams drown in Kubernetes complexity — both as a builder and an operator — I decided to build the infrastructure I wish existed. 55,000 lines of working code is the proof.

Business Model

Open Source — Full orchestrator. Runtime, networking, clustering, GPU mesh. Free forever.

Enterprise — Multi-cluster federation, audit logging, SSO/RBAC, SLAs. $500-2K/node/year.

Cloud — Managed yoq clusters. One-click GPU training infra.

Comparable outcomes:

  • HashiCorp (Nomad) — Acquired by IBM for $6.4B — same OSS-core model
  • CoreWeave — $35B valuation running GPU infrastructure
  • Replicated — $1B+ valuation, enterprise K8s tooling

The Ask

$XM Seed — 6 months to v1.0. First paying customers.

Use of funds:

  • Engineering — 2 systems engineers (Zig/Linux/eBPF)
  • Infrastructure — 500-node scale validation
  • Security — GPU isolation audit
  • Customers — 5-10 design partners

Milestones:

  • Month 3 — GPU mesh working, design partner agreements
  • Month 6 — v1.0 shipped, 500-node validated
  • Month 9 — First enterprise contracts
  • Year 1 — $500K-1M ARR from 5-10 customers

Closing

The GPU infrastructure market is massive, chaotic, and hungry for simplicity.

We're building the obvious answer for the 90% of teams that don't need Kubernetes.

55,000 lines of working code. Zero dependencies. Ship it with scp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment