Skip to content

Instantly share code, notes, and snippets.

@dims
Last active March 16, 2026 22:20
Show Gist options
  • Select an option

  • Save dims/db188beb19c8b3c2904017e4f269105c to your computer and use it in GitHub Desktop.

Select an option

Save dims/db188beb19c8b3c2904017e4f269105c to your computer and use it in GitHub Desktop.
CNCF K8s AI Conformance Analysis - 2026-03-16 (SHA: 223d15f)

CNCF Kubernetes AI Conformance - Full Analysis

Date: 2026-03-16 Repository: github.com/cncf/k8s-ai-conformance Commit SHA: 223d15f97434ea478f1440d73901435d16503682 Branch: main


Table of Contents

  1. Overview
  2. Submission Matrix by Version
  3. Software Stack Matrix
  4. Open Source Project Usage Matrix
  5. Open Source Project Reference
  6. NVIDIA Ecosystem Summary
  7. Key Findings

Overview

The CNCF Kubernetes AI Conformance program certifies that Kubernetes platforms can reliably run AI/ML workloads (training, inference, agentic). It is a self-assessment process (automated tests planned for 2026). Certifications are per Kubernetes version, valid one year. Prerequisite: existing K8s conformance.

Conformance categories (MUST level):

  1. Accelerators - DRA support (SHOULD in v1.33, MUST in v1.34+)
  2. Networking - Gateway API for AI inference
  3. Scheduling - Gang scheduling, cluster autoscaling, pod autoscaling (HPA)
  4. Observability - Accelerator metrics, AI service metrics
  5. Security - Secure accelerator access / isolation
  6. Operator - Robust AI controller/CRD (e.g., Ray, Kubeflow)

v1.35 additions (SHOULD level): driver_runtime_management, gpu_sharing, virtualized_accelerator

Current submissions: 11 (v1.33) + 15 (v1.34) + 2 (v1.35) = 28 total


Submission Matrix by Version

v1.33 (11 submissions)

# Directory Vendor Product Platform Version K8s Version Cloud/On-Prem
1 chinaunicom-csk Chinaunicom Cloud CSK v1.33 v1.33 Cloud
2 cks CoreWeave CoreWeave Kubernetes Service v1.33 v1.33 Cloud (GPU)
3 daocloud DaoCloud DaoCloud Enterprise v5.0 v1.33 On-Prem
4 gardener NeoNephos Foundation Gardener v1.130.0 v1.33 Multi-Cloud
5 giantswarm Giant Swarm Giant Swarm Platform 1.33.0 v1.33 AWS
6 jdcloud JD Cloud JCS for Kubernetes v1.33.3 v1.33 Cloud
7 jdos JD.com JDOS v3.0 v1.33 On-Prem
8 openshift Red Hat OpenShift Container Platform 4.20 v1.33 Hybrid
9 palette Spectro Cloud Spectro Cloud Palette 4.8.x v1.33 AWS
10 rke2 SUSE RKE2 v1.33 v1.33 Any
11 talos Sidero Labs Talos Linux 1.11.3 v1.33 Bare Metal

v1.34 (15 submissions)

# Directory Vendor Product Platform Version K8s Version Cloud/On-Prem
1 ack Alibaba Cloud ACK 1.34.1-aliyun.1 v1.34 Cloud
2 aks Microsoft Azure Kubernetes Service v1.34 v1.34 Cloud
3 baidu_cce Baidu Cloud CCE 1.34 v1.34 Cloud
4 cks CoreWeave CoreWeave Kubernetes Service v1.34 v1.34 Cloud (GPU)
5 eks AWS Amazon EKS 1.34.1-eks.4 v1.34 Cloud
6 gardener NeoNephos Foundation Gardener v1.134.2 v1.34 Multi-Cloud
7 gke Google Google Kubernetes Engine 1.34.0-gke.1662000 v1.34 Cloud
8 kubermatic Kubermatic Kubermatic Kubernetes Platform v2.29 v1.34 Multi-Cloud
9 lke Akamai Linode Kubernetes Engine v1.34 v1.34 Cloud
10 OKE Oracle OCI Kubernetes Engine v1.34 v1.34 Cloud
11 openshift Red Hat OpenShift Container Platform 4.21 v1.34 Hybrid
12 ovh OVHcloud OVHcloud Managed Kubernetes 1.0 v1.34 Cloud
13 rke2 SUSE RKE2 v1.34 v1.34 Any
14 talos Sidero Labs Talos Linux 1.11.3 v1.34 Bare Metal
15 vks Broadcom vSphere Kubernetes Service v3.5.0 v1.34 On-Prem (VMware)

v1.35 (2 submissions)

# Directory Vendor Product Platform Version K8s Version Cloud/On-Prem
1 cks CoreWeave CoreWeave Kubernetes Service v1.35 v1.35 Cloud (GPU)
2 gke Google Google Kubernetes Engine 1.35.0-gke.2232000 v1.35 Cloud

Cross-Version Presence

Vendor Product v1.33 v1.34 v1.35
CoreWeave CKS X X X
Sidero Labs Talos Linux X X
NeoNephos Foundation Gardener X X
Red Hat OpenShift X X
SUSE RKE2 X X
Google GKE X X
Alibaba Cloud ACK X
Microsoft AKS X
AWS EKS X
Oracle OKE X
Akamai LKE X
OVHcloud MKS X
Kubermatic KKP X
Broadcom VKS X
Baidu Cloud CCE X
Chinaunicom Cloud CSK X
DaoCloud DaoCloud Enterprise X
Giant Swarm Giant Swarm Platform X
JD Cloud JCS for Kubernetes X
JD.com JDOS X
Spectro Cloud Palette X

Software Stack Matrix

Gang Scheduling Solutions

Solution v1.33 Users v1.34 Users v1.35 Users Total Stacks
Kueue gardener, giantswarm, jdos, daocloud, cks, openshift (RH build) aks, cks, gardener, gke, kubermatic, OKE, openshift (RH build), talos cks, gke 18
Volcano jdcloud ovh, eks 3
Kai Scheduler palette eks 2
SUNK/Slurm cks cks cks 3
Yunikorn eks 1
LeaderWorkerSet eks 1
AWS Batch eks 1
Not specified chinaunicom-csk, rke2, talos ack, baidu_cce, lke, rke2, vks 7

Gateway / Ingress Solutions

Solution v1.33 Users v1.34 Users v1.35 Users Total
K-Gateway / GW API Inference Ext cks cks cks, gke 4
Traefik gardener, jdos, talos gardener, talos 5
Istio daocloud aks, OKE, vks 4
KGateway / Kong palette 1
GKE Gateway gke gke 2
KubeLB kubermatic 1
AWS ALB Controller eks 1
JD Cloud API GW jdcloud 1
Cilium talos talos 2
Gateway API (impl unspecified) chinaunicom-csk, giantswarm, openshift, rke2 ack, baidu_cce, lke, openshift, rke2 9

AI Operators / Frameworks

Operator v1.33 Users v1.34 Users v1.35 Users Total
KubeRay gardener (v1.3.0), giantswarm (v1.0.0), jdos (v1.3.0), palette, talos (v1.4.2) gardener, ovh (v1.5.1), talos (v1.4.2) 8
Kubeflow / Training Operator openshift (Trainer V1), palette aks, gke, kubermatic, OKE, openshift (Trainer V1) gke 8
Ray (framework, not operator) jdcloud ack, cks, eks, gke, OKE cks, gke 8
PyTorch Operator jdcloud, openshift openshift 3
vLLM eks 1
AIBrix eks 1
NVIDIA Triton eks 1
KAITO (MS AI Toolchain) aks 1
DeepSpeed openshift openshift 2
Not specified chinaunicom-csk, cks, rke2 baidu_cce, cks, lke, rke2, vks cks 8

GPU / Accelerator Stack

Component v1.33 Users v1.34 Users v1.35 Users Total
NVIDIA GPU Operator daocloud, gardener, giantswarm (v1.0.1), openshift, palette (v25.3.4) gardener, kubermatic, openshift, ovh 9
NVIDIA DCGM Exporter gardener, jdcloud, jdos, openshift eks, gardener, openshift, ovh 8
NVIDIA Device Plugin talos (v0.14.5) talos (v0.14.5) 2
NVIDIA DRA Driver giantswarm (v25.3.0) ovh 2
NVIDIA Container Toolkit talos talos 2
AMD GPU Operator (ROCm) openshift openshift 2
SUSE AI (integrated stack) rke2 rke2 2
OCI GPU Plugin OKE 1
Managed/unspecified GPU stack chinaunicom-csk, cks ack, aks, baidu_cce, cks, eks, gke, lke, vks cks, gke 12

Cluster Autoscaling Solutions

Solution v1.33 Users v1.34 Users v1.35 Users Total
Kubernetes Cluster Autoscaler chinaunicom-csk, gardener, giantswarm, rke2 gardener, OKE, ovh, rke2 8
Karpenter giantswarm, palette eks 3
Platform-native autoscaler jdcloud ack, aks, baidu_cce, cks, gke, kubermatic, lke, openshift, vks cks, gke 12
N/A (bare metal/on-prem) daocloud, talos talos 3

Pod Autoscaling (HPA) Solutions

Solution v1.33 Users v1.34 Users v1.35 Users Total
KEDA giantswarm (v3.1.0), openshift aks, eks, openshift 5
prometheus-adapter gardener, jdos gardener, ovh 4
DCGM + custom metrics gardener, jdos, palette eks, gardener, ovh 6
Neuron Monitor (AWS) eks 1
CronHPA jdcloud 1
Standard HPA / metrics-server chinaunicom-csk, cks, daocloud, rke2, talos ack, baidu_cce, cks, gke, kubermatic, lke, OKE, rke2, talos, vks cks, gke 16

Observability Stack

Component v1.33 Users v1.34 Users v1.35 Users Total
Prometheus daocloud, gardener, jdcloud, jdos, openshift, palette aks, eks, gardener, kubermatic, lke, OKE, openshift, ovh, vks 15
Grafana cks, palette cks, eks, lke cks 6
OpenTelemetry daocloud, jdcloud eks (ADOT) 3
CloudWatch eks 1
Azure Monitor aks 1
GKE Observability gke gke 2

Container Runtime / CNI / OS (where specified)

Component Submissions
OS: Ubuntu 22.04 palette (v1.33)
OS: Amazon Linux 2023 eks (v1.34)
OS: Bottlerocket eks (v1.34)
OS: Talos Linux talos (v1.33, v1.34)
OS: RHCOS openshift (v1.33, v1.34)
OS: Photon OS vks (v1.34)
CNI: Calico palette (v3.30.3)
CNI: Cilium talos (v1.33, v1.34)
CNI: Canal (Calico+Flannel) rke2
Runtime: CRI-O openshift
Runtime: containerd rke2, talos, vks

Hardware (where specified)

GPU Model Submissions
NVIDIA A100 giantswarm/v1.33 (p4d.24xlarge), ovh/v1.34 (MIG ref)
NVIDIA A10G palette/v1.33
NVIDIA Tesla T4 giantswarm/v1.33, kubermatic/v1.34
NVIDIA Tesla V100 ovh/v1.34 (V100-PCIE-16GB)
NVIDIA Quadro P1000 talos/v1.33, talos/v1.34
Google TPU gke/v1.34, gke/v1.35
AWS Trainium eks/v1.34
AWS Inferentia eks/v1.34
AMD GPUs (ROCm) openshift/v1.33, openshift/v1.34

Open Source Project Usage Matrix

This matrix shows which open source projects are used across all 28 AI-conformant stacks.

Legend

  • M = Explicitly mentioned with version
  • X = Referenced/used (version not specified)
  • - = Not mentioned

v1.33 Submissions

Project chinaunicom-csk cks daocloud gardener giantswarm jdcloud jdos openshift palette rke2 talos
Kueue - X X M M - M X - - M
Volcano - - - - - X - - - - -
Kai Scheduler - - - - - - - - X - -
KubeRay - - - M M - M - X - M
Kubeflow - X - - - - - X X - X
Ray - - - X - X X - - - X
NVIDIA GPU Operator - - X X M - X X M - -
NVIDIA DCGM Exporter - - - X - X X X - - -
NVIDIA Device Plugin - - - - - - - - - - M
NVIDIA DRA Driver - - - - M - - - - - -
NVIDIA Container Toolkit - - - - - - - - - - X
AMD GPU Operator - - - - - - - X - - -
Prometheus - - X X - X X X X - -
Grafana - X - - - X - - X - -
OpenTelemetry - - X - - X - - - - -
Istio - - X - - - - - - - -
Traefik - - - X - - X - - - M
Cilium - - - - - - - - - - X
Calico - - - - - - - - M - -
Gateway API - X - X M - X X X - M
KEDA - - - - M - - X - - -
Karpenter - - - - X - - - X - -
K8s Cluster Autoscaler X - - X X - - - X X -
metrics-server - X - - - - - - - - -
prometheus-adapter - - X X - - X - - - -
Flux - - - - X - - - - - -
JobSet - - - - M - - - - - -
SUNK/Slurm - X - - - - - - - - -
DeepSpeed - - - - - - - X - - -
Sonobuoy - - X - X - - - - - -
SUSE AI - - - - - - - - - X -
KGateway/Kong - - - - - - - - X - -
PyTorch Operator - - - - - X - X - - -

v1.34 Submissions

Project ack aks baidu_cce cks eks gardener gke kubermatic lke OKE openshift ovh rke2 talos vks
Kueue - X - X - M X X - X X - - M -
Volcano - - - - X - - - - - - X - - -
Kai Scheduler - - - - X - - - - - - - - - -
KubeRay - - - - - X - - - - - M - M -
Kubeflow - X - X X - X X - X X - - - -
Ray X X - X X X X - - X - X - X -
NVIDIA GPU Operator - - - - - X - X - - X X - - -
NVIDIA DCGM Exporter - - - - X X - - - - X X - - -
NVIDIA Device Plugin - - - - - - - - - X - - - M -
NVIDIA DRA Driver - - - - - - - - - - - X - - -
NVIDIA Container Toolkit - - - - - - - - - - - - - X -
AMD GPU Operator - - - - - - - - - - X - - - -
Prometheus X X - - X X - X X X X X - - X
Grafana - - - X X - - - X - - - - - -
OpenTelemetry - - - - X - - - - - - - - - -
Istio - X - - - - - - - X - - - - X
Traefik - - - - - X - - - - - - - X -
Cilium - - - - - - - - - - - - - X -
Gateway API - X - X X X X - - X X - - M -
KEDA - X - - X - - - - - X - - - -
Karpenter - - - - X - - - - - - - - - -
K8s Cluster Autoscaler - - - - - X - X - X - X X - -
metrics-server - - - X - - - - - - - - - - -
prometheus-adapter - - - - - X - - - - - X - - -
vLLM - - - - X - - - - - - - - - -
AIBrix - - - - X - - - - - - - - - -
NVIDIA Triton - - - - X - - - - - - - - - -
KAITO - X - - - - - - - - - - - - -
KubeLB - - - - - - - X - - - - - - -
Yunikorn - - - - X - - - - - - - - - -
LeaderWorkerSet - - - - X - - - - - - - - - -
SUNK/Slurm - - - X - - - - - - - - - - -
DeepSpeed - - - - - - - - - - X - - - -
SUSE AI - - - - - - - - - - - - X - -
PyTorch Operator - - - - - - - - - - X - - - -

v1.35 Submissions

Project cks gke
Kueue X X
Kubeflow X X
Ray X X
K-Gateway X X
metrics-server X -
SUNK/Slurm X -
Grafana X -
Gateway API - X

Open Source Project Adoption Summary

Total unique stacks using each project across all 28 submissions:

# Project Stacks (of 28) % CNCF Status Category
1 Kueue 18 / 28 64% K8s SIG Scheduling
2 Gateway API 17 / 28 61% K8s SIG Networking
3 Prometheus 15 / 28 54% Graduated Observability
4 Ray 14 / 28 50% - AI Framework
5 Kubeflow 13 / 28 46% Incubating AI Platform
6 NVIDIA GPU Operator 9 / 28 32% - Accelerator
7 KubeRay 8 / 28 29% - AI Operator
8 NVIDIA DCGM Exporter 8 / 28 29% - Observability
9 K8s Cluster Autoscaler 8 / 28 29% K8s Core Autoscaling
10 Grafana 6 / 28 21% - Observability
11 KEDA 5 / 28 18% Graduated Autoscaling
12 Traefik 5 / 28 18% - Networking
13 Istio 4 / 28 14% Graduated Networking
14 K-Gateway (GW API Inf. Ext) 4 / 28 14% K8s SIG Networking
15 prometheus-adapter 4 / 28 14% K8s SIG Observability
16 SUNK/Slurm 3 / 28 11% - (Proprietary) Scheduling
17 Karpenter 3 / 28 11% K8s SIG Autoscaling
18 Volcano 3 / 28 11% Incubating Scheduling
19 OpenTelemetry 3 / 28 11% Incubating Observability
20 PyTorch Operator 3 / 28 11% - AI Operator
21 metrics-server 3 / 28 11% K8s SIG Observability
22 DeepSpeed 2 / 28 7% - AI Framework
23 Cilium 2 / 28 7% Graduated Networking
24 NVIDIA Device Plugin 2 / 28 7% - Accelerator
25 NVIDIA DRA Driver 2 / 28 7% - Accelerator
26 NVIDIA Container Toolkit 2 / 28 7% - Accelerator
27 AMD GPU Operator 2 / 28 7% - Accelerator
28 Kai Scheduler 2 / 28 7% Sandbox Scheduling
29 SUSE AI 2 / 28 7% - AI Platform
30 Sonobuoy 2 / 28 7% - Testing
31 Flux 1 / 28 4% Graduated GitOps
32 Calico 1 / 28 4% - Networking
33 JobSet 1 / 28 4% K8s SIG Scheduling
34 KubeLB 1 / 28 4% - Networking
35 Yunikorn 1 / 28 4% ASF Scheduling
36 LeaderWorkerSet 1 / 28 4% K8s SIG Scheduling
37 vLLM 1 / 28 4% - AI Inference
38 AIBrix 1 / 28 4% - AI Inference
39 NVIDIA Triton 1 / 28 4% - AI Inference
40 KAITO 1 / 28 4% - AI Operator
41 KGateway/Kong 1 / 28 4% - Networking

Open Source Project Reference

Project GitHub Repo License CNCF Status Description
Kueue kubernetes-sigs/kueue Apache-2.0 K8s SIG Kubernetes-native job queueing
Volcano volcano-sh/volcano Apache-2.0 Incubating Batch scheduling for K8s
Kai Scheduler kai-scheduler/KAI-Scheduler Apache-2.0 Sandbox GPU-optimized AI scheduler
KubeRay ray-project/kuberay Apache-2.0 - Ray on Kubernetes operator
Kubeflow kubeflow/kubeflow Apache-2.0 Incubating ML platform for K8s
Kubeflow Trainer kubeflow/training-operator Apache-2.0 Incubating Distributed training operator
Ray ray-project/ray Apache-2.0 - Distributed AI framework
vLLM vllm-project/vllm Apache-2.0 - LLM inference engine
AIBrix vllm-project/aibrix Apache-2.0 - GenAI inference components
DeepSpeed microsoft/DeepSpeed Apache-2.0 - Distributed training library
NVIDIA GPU Operator NVIDIA/gpu-operator Apache-2.0 - GPU lifecycle management
NVIDIA DCGM Exporter NVIDIA/dcgm-exporter Apache-2.0 - GPU metrics for Prometheus
NVIDIA Device Plugin NVIDIA/k8s-device-plugin Apache-2.0 - GPU device plugin for K8s
NVIDIA DRA Driver NVIDIA/k8s-dra-driver Apache-2.0 - DRA driver for GPUs
NVIDIA Container Toolkit NVIDIA/nvidia-container-toolkit Apache-2.0 - GPU container runtime
AMD GPU Operator ROCm/gpu-operator Apache-2.0 - AMD GPU management
Prometheus prometheus/prometheus Apache-2.0 Graduated Monitoring system
Grafana grafana/grafana AGPL-3.0 - Observability platform
OpenTelemetry open-telemetry/opentelemetry-collector Apache-2.0 Incubating Telemetry collection
Istio istio/istio Apache-2.0 Graduated Service mesh
Traefik traefik/traefik MIT - Cloud-native proxy
Cilium cilium/cilium Apache-2.0 Graduated eBPF networking
Calico projectcalico/calico Apache-2.0 - K8s networking
KEDA kedacore/keda Apache-2.0 Graduated Event-driven autoscaling
Karpenter kubernetes-sigs/karpenter Apache-2.0 K8s SIG Node autoscaling
K8s Cluster Autoscaler kubernetes/autoscaler Apache-2.0 K8s Core Cluster autoscaling
metrics-server kubernetes-sigs/metrics-server Apache-2.0 K8s SIG Resource metrics
prometheus-adapter kubernetes-sigs/prometheus-adapter Apache-2.0 K8s SIG Custom metrics API
Gateway API kubernetes-sigs/gateway-api Apache-2.0 K8s SIG K8s networking API
K-Gateway kubernetes-sigs/gateway-api-inference-extension Apache-2.0 K8s SIG AI inference gateway
Flux fluxcd/flux2 Apache-2.0 Graduated GitOps toolkit
JobSet kubernetes-sigs/jobset Apache-2.0 K8s SIG Multi-job orchestration
LeaderWorkerSet kubernetes-sigs/lws Apache-2.0 K8s SIG LLM inference sharding
KubeLB kubermatic/kubelb Apache-2.0 - Centralized load balancing
Yunikorn apache/yunikorn-core Apache-2.0 ASF Resource scheduler
KAITO microsoft/kaito MIT - AI toolchain operator
NVIDIA Triton triton-inference-server/server BSD-3-Clause - Inference server
Gardener gardener/gardener Apache-2.0 - K8s cluster management
Talos Linux siderolabs/talos MPL-2.0 - Minimal K8s OS
RKE2 rancher/rke2 Apache-2.0 - Secure K8s distribution
Omni siderolabs/omni BSL-1.1 - Talos cluster management
podinfo stefanprodan/podinfo Apache-2.0 - K8s test microservice
CubeFS (fmr ContainerFS) cubefs/cubefs Apache-2.0 Graduated Distributed storage

NVIDIA Ecosystem Summary

NVIDIA dominates the accelerator layer across the AI conformance program. Every single submission relies on NVIDIA technology in some form -- either directly via open-source NVIDIA projects or indirectly through managed cloud GPU services built on NVIDIA hardware.

NVIDIA Project Adoption Across All 28 Stacks

NVIDIA Project GitHub Repo Stacks (of 28) % of 28 Versions Observed
NVIDIA GPU Operator NVIDIA/gpu-operator 9 32% v1.0.1 (Giant Swarm), v25.3.4 (Palette)
NVIDIA DCGM Exporter NVIDIA/dcgm-exporter 8 29% (versions not specified)
NVIDIA Device Plugin NVIDIA/k8s-device-plugin 2 7% v0.14.5 (Talos)
NVIDIA DRA Driver NVIDIA/k8s-dra-driver 2 7% v25.3.0 (Giant Swarm)
NVIDIA Container Toolkit NVIDIA/nvidia-container-toolkit 2 7% (version not specified)
NVIDIA Triton Inference Server triton-inference-server/server 1 4% (version not specified)
Kai Scheduler (originally NVIDIA) kai-scheduler/KAI-Scheduler 2 7% (version not specified)

NVIDIA Project Usage by Submission (detailed)

Submission Version GPU Operator DCGM Exporter Device Plugin DRA Driver Container Toolkit Triton Kai Scheduler NVIDIA Projects Used
chinaunicom-csk v1.33 - - - - - - - 0
cks v1.33 - - - - - - - 0*
daocloud v1.33 X - - - - - - 1
gardener v1.33 X X - - - - - 2
giantswarm v1.33 X - - X - - - 2
jdcloud v1.33 - X - - - - - 1
jdos v1.33 X X - - - - - 2
openshift v1.33 X X - - - - - 2
palette v1.33 X - - - - - X 2
rke2 v1.33 - - - - - - - 0**
talos v1.33 - - X - X - - 2
ack v1.34 - - - - - - - 0*
aks v1.34 - - - - - - - 0*
baidu_cce v1.34 - - - - - - - 0*
cks v1.34 - - - - - - - 0*
eks v1.34 - X - - - X X 3
gardener v1.34 X X - - - - - 2
gke v1.34 - - - - - - - 0*
kubermatic v1.34 X - - - - - - 1
lke v1.34 - - - - - - - 0*
OKE v1.34 - - X - - - - 1
openshift v1.34 X X - - - - - 2
ovh v1.34 X X - X - - - 3
rke2 v1.34 - - - - - - - 0**
talos v1.34 - - X - X - - 2
vks v1.34 - - - - - - - 0*
cks v1.35 - - - - - - - 0*
gke v1.35 - - - - - - - 0*

* Managed cloud services use NVIDIA GPUs but don't disclose specific NVIDIA software components in their submissions. ** SUSE AI stack likely includes NVIDIA components but bundles them under its own umbrella.

NVIDIA Hardware Referenced in Submissions

GPU Model Submissions Notes
NVIDIA A100 giantswarm/v1.33 (AWS p4d.24xlarge), ovh/v1.34 (MIG reference) High-end training/inference
NVIDIA A10G palette/v1.33 (AWS) Mid-range inference
NVIDIA Tesla T4 giantswarm/v1.33, kubermatic/v1.34 Inference-optimized
NVIDIA Tesla V100 ovh/v1.34 (V100-PCIE-16GB) Training workhorse
NVIDIA Quadro P1000 talos/v1.33, talos/v1.34 Entry-level workstation

NVIDIA Aggregate Counts

Metric Count
Total NVIDIA open-source projects used across all stacks 7
Stacks explicitly referencing at least 1 NVIDIA project 15 / 28 (54%)
Stacks using NVIDIA GPUs (explicit + managed cloud) 28 / 28 (100%)
Stacks using NVIDIA GPU Operator 9 (32%)
Stacks using NVIDIA DCGM Exporter 8 (29%)
Stacks using NVIDIA DRA Driver (new DRA path) 2 (7%)
Stacks using NVIDIA Device Plugin (legacy path) 2 (7%)
Stacks referencing NVIDIA Triton 1 (4%)
Stacks using Kai Scheduler (NVIDIA-originated) 2 (7%)
Unique NVIDIA GPU models documented 5 (A100, A10G, T4, V100, Quadro P1000)
Submissions with NVIDIA as only GPU vendor 26 / 28 (93%)
Submissions also supporting non-NVIDIA accelerators 3 (OpenShift: AMD, GKE: TPU, EKS: Trainium/Inferentia)

NVIDIA vs Non-NVIDIA Accelerator Support

Accelerator Ecosystem Stacks Supporting Vendors
NVIDIA GPUs only 25 All except OpenShift, GKE, EKS
NVIDIA + AMD (ROCm) 2 Red Hat (OpenShift v1.33, v1.34)
NVIDIA + Google TPU 2 Google (GKE v1.34, v1.35)
NVIDIA + AWS Trainium/Inferentia 1 AWS (EKS v1.34)

Key NVIDIA Takeaways

  1. NVIDIA has 100% market penetration -- every single AI-conformant Kubernetes stack uses NVIDIA GPUs, making it the only universal hardware dependency in the program.

  2. The GPU Operator + DCGM Exporter duo is the de facto standard for self-managed platforms (32% and 29% explicit adoption), while managed clouds abstract these away.

  3. DRA Driver adoption is nascent -- only 2 stacks (Giant Swarm, OVH) explicitly use the new k8s-dra-driver, vs 2 still on the legacy Device Plugin. Most managed clouds haven't disclosed their DRA implementation details.

  4. NVIDIA Triton appears in only 1 submission (EKS) despite being the leading inference server, suggesting most vendors treat inference serving as user-deployed rather than platform-provided.

  5. Kai Scheduler, originally an NVIDIA project (now CNCF Sandbox), is used by 2 stacks (Palette, EKS), positioning NVIDIA in the scheduling layer as well.

  6. Only 3 of 28 stacks support any non-NVIDIA accelerator -- OpenShift (AMD), GKE (TPU), and EKS (Trainium/Inferentia). This makes NVIDIA the single point of hardware dependency for 93% of conformant platforms.


Key Findings

1. Market Participation

  • 28 unique submissions from 21 distinct vendors across 3 Kubernetes versions
  • 5 vendors have submitted for multiple versions (CoreWeave leads with all 3)
  • v1.34 has the most submissions (15), suggesting the program gained traction after launch
  • v1.35 is early with only 2 submissions (CoreWeave and Google)

2. Dominant Open Source Stack

The "default" AI-conformant Kubernetes stack converges on:

  • Kueue for gang scheduling (18/28 stacks, 64%)
  • Prometheus for observability (15/28, 54%)
  • Gateway API for networking (17/28, 61%)
  • Ray/KubeRay for AI operators (14/28, 50%)
  • NVIDIA GPU Operator + DCGM for accelerator management (9+8/28)
  • Kubernetes Cluster Autoscaler or Karpenter for scaling

3. Scheduling Landscape

  • Kueue dominates gang scheduling with 64% adoption
  • Volcano is a distant second (3 stacks, primarily Chinese cloud providers + OVH)
  • AWS EKS stands out by supporting the most schedulers (Volcano, Kai, Yunikorn, LeaderWorkerSet, AWS Batch)

4. Accelerator Diversity

  • NVIDIA GPUs are universal -- every submission uses NVIDIA in some form
  • AMD GPUs (ROCm): Only Red Hat OpenShift
  • Google TPU: Only GKE
  • AWS Trainium/Inferentia: Only EKS
  • GPU sharing/MIG: Only GKE (v1.35) explicitly covers SHOULD-level requirements

5. CNCF Project Penetration

Among the most-used projects:

  • 5 CNCF Graduated: Prometheus, KEDA, Istio, Cilium, Flux
  • 3 CNCF Incubating: Kubeflow, Volcano, OpenTelemetry
  • 1 CNCF Sandbox: Kai Scheduler
  • 7 Kubernetes SIG projects: Kueue, Gateway API, Karpenter, Cluster Autoscaler, metrics-server, prometheus-adapter, K-Gateway
  • NVIDIA projects dominate the accelerator layer (not CNCF)

6. Notable Gaps

  • Most managed cloud services (ACK, AKS, CCE, GKE, EKS, OKE, LKE) don't specify internal stack details
  • Container runtime and CNI are rarely documented (only 5-6 submissions specify these)
  • Test artifacts (e2e logs) are rare -- only DaoCloud provides actual test output
  • The v1.35 SHOULD-level requirements (GPU sharing, vGPU, driver management) are only addressed by GKE

7. Vendor Differentiation

  • AWS EKS: Broadest accelerator and scheduler ecosystem (Trainium, Inferentia, 5 schedulers, vLLM, AIBrix, Triton)
  • Red Hat OpenShift: Only dual-GPU vendor (NVIDIA + AMD), most complete operator detail (Kubeflow Trainer V1)
  • CoreWeave CKS: Only vendor with all 3 versions, GPU-native cloud, proprietary SUNK scheduler
  • Google GKE: Only TPU support, only v1.35 submission covering all SHOULD requirements
  • Sidero Labs Talos: Only bare-metal-first OS submission, most transparent about hardware limitations

Analysis Metadata

Use this file to detect deltas in future analyses.

Analysis Date: 2026-03-16 Repository: github.com/cncf/k8s-ai-conformance Commit SHA: 223d15f97434ea478f1440d73901435d16503682 Branch: main

Submission Inventory at Time of Analysis

v1.33 (11 submissions)

  • chinaunicom-csk
  • cks
  • daocloud
  • gardener
  • giantswarm
  • jdcloud
  • jdos
  • openshift
  • palette
  • rke2
  • talos

v1.34 (15 submissions)

  • OKE
  • ack
  • aks
  • baidu_cce
  • cks
  • eks
  • gardener
  • gke
  • kubermatic
  • lke
  • openshift
  • ovh
  • rke2
  • talos
  • vks

v1.35 (2 submissions)

  • cks
  • gke

Version Directories Present

  • v1.33
  • v1.34
  • v1.35

Total Submissions: 28

Total Unique Vendors: 21

How to Detect Deltas

# From repo root, compare current submissions against this baseline:
git log --oneline 223d15f..HEAD -- 'v1.*/*/PRODUCT.yaml'

# List all current submissions:
find v1.* -name PRODUCT.yaml -type f | sort

# New version directories:
ls -d v1.* | sort
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment