Skip to content

Instantly share code, notes, and snippets.

@dims
Created April 21, 2026 17:04
Show Gist options
  • Select an option

  • Save dims/f7c17059cff4549bde78d253971033f1 to your computer and use it in GitHub Desktop.

Select an option

Save dims/f7c17059cff4549bde78d253971033f1 to your computer and use it in GitHub Desktop.
CI Coverage Map — sigs.k8s.io/dra-driver-nvidia-gpu (Lambda/GCP-nvkind/mock-nvml providers, BATS suites, TestGrid tabs, GPU_TYPE= resolution, gap analysis)

CI Coverage Map — sigs.k8s.io/dra-driver-nvidia-gpu

As of 2026-04-21. Sources: .github/workflows/, kubernetes/test-infra (config/jobs/kubernetes-sigs/dra-driver-nvidia-gpu/, config/testgrids/nvidia/nvidia.yaml), testgrid.k8s.io/nvidia-gpu, hack/ci/{gcp-nvkind,lambda,mock-nvml}, tests/bats/, test/e2e/.

TL;DR

  • 3 execution surfaces: GitHub Actions (lint/unit/mock-e2e only), Prow on Lambda Cloud (real GPUs, BATS), Prow on GCP-nvkind (T4 GCE, Ginkgo).
  • 7 Prow jobs on this repo: 3 e2e presubmits + 3 e2e periodics + 1 image-push postsubmit.
  • Only Lambda/arm64 (GH200) gives real arm64 GPU coverage. GCP-nvkind is amd64/T4 only.
  • Nothing is truly a required check. GitHub branch protection on main and release-25.8 lists EasyCLA as the only required status. No rulesets configured. Every CI signal above — GH Actions lint/unit/mock-e2e and all 4 Prow e2e presubmits (optional: true) — posts status but cannot block merge. Merge gating is effectively: EasyCLA + tide/OWNERS approval.
  • No CI ever runs tests-cd (ComputeDomain full suite on real NVLink fabric). Only tests-mock-nvml and tests-gpu-single are wired.
  • DynMIG is exercised on CItest_gpu_dynmig.bats is in tests-gpu-single, and hack/ci/lambda/e2e-test.sh leaves DynMIG enabled on *h100*|*gh200*|*b200*. So every Lambda GH200 run does hit a dynamic-MIG path. Static MIG (test_gpu_mig.bats) still never runs in CI.
  • Lambda x86 jobs use GPU_TYPE=""lambdactl watch picks cheapest-available-any-region. Last 10 periodic runs: 10/10 gpu_1x_a10. Last 10 presubmit runs: 5× A10, 2× A100 SXM4, 3× blocked on gpu_8x_v100_n quota-exceeded (non-retryable, contributes to 50% presubmit flake).

Table 1 — Master CI job matrix

Every distinct job/workflow that runs against this repo. Housekeeping bots (stale, cherry-pick, issue-triage) are excluded; see end of section.

Columns: "Gates" = Prow-level configuration only (e.g. always_run, optional, max_concurrency). No job in this table is a merge-required check — see TL;DR on branch protection.

# Job / Workflow Platform Type Trigger / Cadence Provider GPU Arch K8s Suite / Target TestGrid tab Gates Status 2026-04-21
1 ci.yaml → golang check GH Actions PR + push to main/release-* every PR/push none amd64 make golangci-lint, generated-code check, go mod validate
2 ci.yaml → golang test GH Actions PR + push every PR/push none amd64 make test (Go unit)
3 ci.yaml → golang build GH Actions PR + push every PR/push none amd64 make build
4 ci.yaml → image GH Actions PR + push every PR/push none amd64+arm64 (QEMU) make build multi-arch OCI (no push)
5 ci.yaml → chart GH Actions PR + push every PR/push none amd64 helm lint + package
6 code_scanning.yaml GH Actions called from basic-checks every PR/push none amd64 CodeQL Go
7 mock-nvml-e2e.yaml GH Actions PR (paths-filtered) + push main on-PR mock-nvml (Kind + mocked NVML) virtual 8×GB200 amd64 latest stable BATS tests-mock-nvml
8 tests.yaml GH Actions workflow_dispatch only manual placeholder (echoes "bats runs on Prow") noop
9 pull-dra-driver-nvidia-gpu-e2e-lambda-gpu Prow presubmit every PR (skip release-*) Lambda Cloud (kubeadm on bare metal) GPU_TYPE="" → cheapest-available (see §GPU selection). Recent: A10 71%, A100 SXM4 29% amd64 latest stable BATS tests-gpu-single pull-dra-driver-nvidia-gpu-lambda always_run:true, optional:true, max_concurrency:1, 2h FLAKY 50%
10 pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200 Prow presubmit every PR (skip release-*) Lambda Cloud GH200 (1×) arm64 latest stable BATS tests-gpu-single pull-dra-driver-nvidia-gpu-lambda-gh200 always_run:true, optional:true, max_concurrency:1, 2h FLAKY 50%
11 pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind Prow presubmit every PR (skip release-*) GCP-nvkind (GCE VM → nvkind) T4 (1×) amd64 v1.34.3 (Ubuntu 22.04 DLVM) Ginkgo test/e2e/ pull-dra-driver-nvidia-gpu-gcp-nvkind always_run:true, optional:true, max_concurrency:1, 2h, Boskos gpu-project PASSING
12 ci-dra-driver-nvidia-gpu-e2e-lambda-gpu Prow periodic interval: 6h Lambda Cloud GPU_TYPE="" → cheapest-available. Recent 10/10: A10 @ us-east-1 amd64 latest stable BATS tests-gpu-single ci-dra-driver-nvidia-gpu-lambda 2h PASSING 100%
13 ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200 Prow periodic cron: 30 0,6,12,18 * * * (6h, offset) Lambda Cloud GH200 (1×) arm64 latest stable BATS tests-gpu-single ci-dra-driver-nvidia-gpu-lambda-gh200 2h PASSING 100%
14 ci-dra-driver-nvidia-gpu-e2e-gcp-nvkind Prow periodic interval: 6h GCP-nvkind T4 (1×) amd64 v1.35.1 (Ubuntu 24.04 DLVM) Ginkgo test/e2e/ ci-dra-driver-nvidia-gpu-gcp-nvkind 2h, Boskos gpu-project FLAKY 70% (7/10 recent columns; live testgrid snapshot — numbers move)
15 post-dra-driver-nvidia-gpu-push-images Prow postsubmit merge to main, release-*, SemVer tags GCB image-builder run.sh → push to k8s-staging-images sig-node-image-pushes, sig-k8s-infra-gcb trusted cluster

Excluded (housekeeping bots): cherrypick.yml, issue-triage.yml, stale.yml (daily cron 04:30 UTC).

Notes on the master table:

  • The Prow periodic GCP-nvkind pins v1.35.1 + Ubuntu 24.04 while the presubmit pins v1.34.3 + Ubuntu 22.04 — deliberate drift so periodics smoke-test newer k8s/OS.
  • The GH200 periodic uses a cron (30 0,6,12,18) instead of interval: to offset 3h from the sibling ci-kubernetes-e2e-lambda-device-plugin-gpu-gh200 and avoid GH200 capacity contention.
  • All Lambda jobs carry preset preset-lambda-credential → injects LAMBDA_API_KEY_FILE=/etc/lambda-cred/api-key.
  • All e2e jobs use the same container: us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20260316-e86cefa561-master.

Table 2 — BATS test × SUITE selector matrix

Which .bats file is passed to bats under each make -f tests/bats/Makefile <target>. Transcribed directly from tests/bats/Makefile (tests-mock-nvml:187, tests-gpu-single:204, tests-gpu:214, tests-cd:225, tests:236). File-included is not the same as test-executed: tests-mock-nvml sets MOCK_NVML=true, under which several tests auto-skip (per-@test guards), and hack/ci/mock-nvml/e2e-test.sh also applies --filter-tags exclusions (!cuda-workload,!dynmig,!mig,!compute-domain,!multi-node,!gpu-busgrind,!version-specific).

BATS file tests (full) tests-gpu tests-gpu-single tests-mock-nvml tests-cd Hardware requirement
test_basics.bats none (sanity; expects GPU Operator)
test_gpu_basic.bats any GPU
test_gpu_extres.bats K8s ≥1.35 + DRAExtendedResource
test_gpu_robustness.bats any GPU
test_gpu_stress.bats any GPU
test_gpu_updowngrade.bats prior-release image in registry
test_gpu_sharing.bats any GPU (real MPS daemon for one case)
test_gpu_dynmig.bats MIG-capable GPU + DynamicMIG=true
test_gpu_mig.bats MIG-capable (A100/H100/B200/GB200)
test_gpu_cuda_workloads.bats ✓ (see note: 2 of 4 tests actually run under MOCK) real CUDA compute (2 tests); other 2 just use ResourceClaimTemplate semantics
test_cd_imex_chan_inject.bats ✓ (tests auto-skip on MOCK_NVML=true) IMEX daemon (Blackwell + drv ≥570.158.01)
test_cd_logging.bats ✓ (auto-skip on MOCK) IMEX daemon
test_cd_misc.bats ✓ (auto-skip on MOCK) IMEX daemon
test_cd_updowngrade.bats ✓ (auto-skip on MOCK) IMEX daemon + prior-release image
test_cd_failover.bats ✓ (auto-skip on MOCK) multi-node NVLink fabric (≥2 nodes, 4 GPU/node)
test_cd_mnnvl_workload.bats ✓ (auto-skip on MOCK) multi-node NVLink fabric, real NCCL, MPI Operator
Files included 13/16 7/16 6/16 13/16 7/16
Invoked by CI Lambda presubmit + periodic (both arch) GH Actions mock-nvml-e2e

Takeaways:

  • tests is not "all 16 bats files" — it excludes robustness, sharing, and cuda_workloads.
  • tests-mock-nvml includes all 6 CD files and cuda_workloads, but:
    • CD files: every @test in test_cd_*.bats starts with a MOCK_NVML skip guard (tests/bats/test_cd_imex_chan_inject.bats:17 etc.), so they contribute ~zero executed assertions on the mock runner.
    • cuda_workloads: the mock-runner filter !cuda-workload is a no-op — no test in that file carries the cuda-workload tag (they're tagged gpu-workloads and fastfeedback). Of the 4 tests in test_gpu_cuda_workloads.bats: the CUDA-demo-suite test (line 31) and the busGrind test (line 118) skip via MOCK_NVML guards; the Job-with-ResourceClaimTemplate and Deployment-2-replicas tests (lines 52, 82) do NOT skip and actually execute on the mock runner. So mock-nvml does exercise RCT/deployment paths, just not real CUDA compute.
  • tests-gpu-single includes test_gpu_dynmig, so dynamic-MIG paths do get exercised in CI on GPUs the Lambda driver leaves unfiltered (H100 / GH200 / B200). Static MIG (test_gpu_mig) is only in tests / tests-gpu, neither of which is wired to CI.
  • The comment in hack/ci/mock-nvml/e2e-test.sh:377-379 ("We skip test_gpu_cuda_workloads.bats because it includes a CUDA demo suite test …") is stale — the file is actually included via tests-mock-nvml, and skipping happens per-test via MOCK_NVML guards, not at the file level.

Table 3 — Provider × arch × suite coverage slice

A compact view of what gets actually run where.

Provider Caller Arch GPU model Real CUDA? DynMIG? Static MIG? IMEX / ComputeDomain? Multi-GPU? Multi-node? Suite run
GH Actions runner + mock-nvml GH Actions PR amd64 8× virtual GB200 partial (non-tag tests run; one CUDA test auto-skips on MOCK) ✗ (filter !dynmig) ✗ (filter !mig) files included, but every CD test auto-skips on MOCK_NVML=true ✓ (virtual 8×) tests-mock-nvml (13 files included, many skip at runtime)
Lambda (x86, A10) Prow presubmit + 6h periodic amd64 A10 (most common) ✗ (A10 not MIG-capable → !dynmig) ✗ (CD disabled unless `gb200 gb300 b200`)
Lambda (x86, A100) Prow presubmit + 6h periodic (when A10 unavailable) amd64 A100 SXM4 40GB (1×) ✗ (single-GPU A100 → !dynmig per e2e-test.sh:112-120) tests-gpu-single (6 files)
Lambda (arm64) Prow presubmit + 6h periodic arm64 GH200 (1×) ✓ (no busGrind — arm64 apt limitation) (GH200 matches *gh200*, DynMIG enabled) ✗ (CD only on `gb200 gb300 b200`)
GCP-nvkind Prow presubmit + 6h periodic amd64 T4 (1×) ✗ (T4 not MIG-capable) Ginkgo test/e2e/ (6 specs)

Table 4 — Other nvidia-gpu TestGrid tabs (context, not this repo)

The nvidia-gpu rollup on testgrid also displays 10 tabs from the NVIDIA device-plugin (k/k) program. Listed for context only — they do not test this driver but share the dashboard:

Tab Job Status
ci-kubernetes-e2e-ec2-device-plugin-gpu periodic FLAKY 80%
ci-lambda-device-plugin-gpu periodic PASSING
ci-lambda-device-plugin-gpu-gh200 periodic FLAKY 70%
gce-device-plugin-gpu-{1.33,1.34,1.35,1.36,master} periodic PASSING / master FLAKY 90%
pull-kubernetes-e2e-ec2-device-plugin-gpu presubmit STALE (last run 2026-03-18)
pull-lambda-device-plugin-gpu presubmit FLAKY 50%

How GPU_TYPE="" actually resolves on Lambda

The two Prow jobs ci-dra-driver-nvidia-gpu-e2e-lambda-gpu and pull-dra-driver-nvidia-gpu-e2e-lambda-gpu pass GPU_TYPE="". The resolution happens in two layers:

Layer 1 — experiment/lambda/lib/lambda-common.sh (test-infra):

LAMBDA_GPU_TYPE="${GPU_TYPE-gpu_1x_a10}"   # '-' not ':-'  → empty stays empty
...
if [ -n "${LAMBDA_GPU_TYPE}" ]; then
  gpu_args=(--gpu "${LAMBDA_GPU_TYPE}")
fi
lambdactl --json watch "${gpu_args[@]}" --ssh ... --interval 30 --timeout 900 --wait-ssh

When empty, --gpu is omitted entirely — no filter, no region pin.

Layer 2 — lambdactl watch (dims/lambdactl, cmd/watch.go):

  1. Poll lambdactl types every 30s for up to 900s.
  2. Keep types with at least one region currently showing availability.
  3. Sort by PriceCents ascending, pick candidates[0].
  4. Launch into Regions[0] of that type.
  5. On a retryable capacity error → continue the loop and re-poll. On a quota error → hard-fail (not retryable).

After the launch returns, the script overwrites LAMBDA_GPU_TYPE with the actual provisioned type so BATS capability gating works on what really got allocated.

Lambda instance catalog (from lambdactl types, snapshot 2026-04-21)

Cheapest-first, so first-available is what gets picked:

Rank SKU $/hr GPU Arch Current avail
1 gpu_1x_a10 $1.29 A10 24GB PCIe x86 1 region
2 gpu_1x_a100_sxm4 $1.99 A100 40GB SXM4 x86 2 regions
3 gpu_2x_a6000 $2.18 2×A6000 48GB x86 0
4 gpu_1x_gh200 $2.29 GH200 96GB arm64 1 region
5 gpu_1x_h100_pcie $3.29 H100 80GB PCIe x86 1 region
6 gpu_1x_h100_sxm5 $4.29 H100 80GB SXM5 x86 1 region
7 gpu_8x_v100_n $6.32 8×V100 16GB x86 1 region
8 gpu_1x_b200_sxm6 $6.99 B200 180GB SXM6 x86 1 region
(heavier SKUs) $8.38–$53.52 2×/4×/8× H100/B200/A100 x86 0

Actual SKUs landed — last 10 runs (as of 2026-04-21)

Periodic ci-dra-driver-nvidia-gpu-e2e-lambda-gpu:

gpu_1x_a10 @ us-east-1   ##########   10/10 (100%)

All ten runs, A10 @ us-east-1. The cheapest SKU has been consistently available during periodic windows.

Presubmit pull-dra-driver-nvidia-gpu-e2e-lambda-gpu (last 10 attempts):

Actually launched (7/10):
  gpu_1x_a10         @ us-east-1   #####   5
  gpu_1x_a100_sxm4   @ us-east-1/us-west-2  ##   2

Pre-launch quota-failed (3/10):
  gpu_8x_v100_n      @ us-south-2   ###   3   ← hard fail, no retry

Three consecutive presubmit failures on 2026-04-18 all hit the same trap: Lambda advertised gpu_8x_v100_n@us-south-2 as available (cheapest-with-capacity at that moment), lambdactl raced to launch it, and the account returned Quota exceeded, which lambdactl treats as non-retryable. This is a real contributor to the 50% flake on the presubmit tab.

One clean example of the capacity-retry path (build 2045306201683529728): gpu_1x_a10 @ us-west-1 hit "Not enough capacity" three times, then gpu_1x_a100_sxm4 @ us-east-1 became cheapest-available on the next poll and launched.

Implications

  • "Lambda x86" ≠ A10. It is A10 most of the time, A100 SXM4 when the A10 pool is tight, and could be any cheaper-than-GH200 SKU if Lambda lowers prices or empties upper pools.
  • MIG never fires even when A100 lands — the job invokes tests-gpu-single, which excludes test_gpu_mig and test_gpu_dynmig. So the rare A100 runs are wasted for MIG coverage.
  • Quota-exceeded on gpu_8x_v100_n is a latent bug. Either the test-infra account gets its V100-8x quota raised, or lambdactl watch needs to learn to treat quota errors as retryable (with a short deny-list for that poll-loop iteration).

Gap analysis — what is missing for this repo

GPU hardware coverage

  • In practice CI lands on: T4 (GCP-nvkind), A10 (Lambda x86, dominant), A100 SXM4 40GB (Lambda x86 fallback, occasional), GH200 (Lambda arm64). Everything else Lambda advertises (H100 PCIe/SXM5, B200 SXM6, V100) is cheaper-than-GH200 so could be selected, but in the last ~20 runs the cheaper SKUs (A10, A100) always won.
  • Only tests-mock-nvml exercises GB200/B200 profiles — all synthetic. On this runner: every CD test skips at runtime (MOCK_NVML guards), and 2 of 4 cuda_workloads tests skip; the remaining 2 (RCT + 2-replica Deployment) do execute.
  • Static MIG (test_gpu_mig.bats) never runs in CI — only appears in tests/tests-gpu, neither of which is invoked.
  • Dynamic MIG (test_gpu_dynmig.bats) runs only on Lambda GH200 — it's in tests-gpu-single, but hack/ci/lambda/e2e-test.sh filters it out except for *h100*|*gh200*|*b200*. The x86 presubmit/periodic (A10, single-GPU A100) always filter !dynmig. If Lambda ever lands H100 PCIe or B200 on the x86 job, those would also exercise DynMIG.

Test-suite coverage

  • tests-cd (full ComputeDomain suite) is not run in any CI — the failover, logging, misc, multi-node workload, and CD-updowngrade tests only run locally or via manual /test overrides if someone wires it up.
  • tests-gpu (full GPU suite, includes MIG / stress / updowngrade) is not run in any CI — Lambda jobs use the -single subset.
  • MIG paths (test_gpu_mig, test_gpu_dynmig) never execute in CI because no MIG-capable GPU is wired.
  • Real-CUDA bats tests (the CUDA-demo-suite and busGrind tests inside test_gpu_cuda_workloads.bats) only run on Lambda (A10/A100/GH200). The other two tests in the same file (RCT + 2-replica Deployment) also run on mock-nvml. Nothing in GCP-nvkind exercises this file at all — it runs Ginkgo test/e2e/, not BATS.

Architecture coverage

  • arm64 is covered only by Lambda GH200. mock-nvml-e2e.yaml runs on ubuntu-latest (amd64) with a multi-arch buildx image but the runtime is amd64.
  • GCP-nvkind is hard-coded amd64 (linux-amd64 download in setup-nvkind-node.sh).

Kubernetes version coverage

  • GCP-nvkind periodic alone smoke-tests v1.35.1. Lambda uses "latest stable" unpinned.
  • Release branches (release-*): all four Prow presubmits have skip_branches: [release-\d+\.\d+], so release branches get no e2e presubmit gating. Periodics are main-only (extra_refs: ...@main). Release branches only get GH-Actions lint/unit/mock-nvml.

Multi-node / NVLink fabric

  • No CI runs multi-node. All ComputeDomain failover/MNNVL tests require ≥2 nodes with 4 GPUs each — nothing in hack/ci/* provisions that topology.

Optionality / blocking

  • All 3 Prow e2e presubmits (pull-*-lambda-gpu, pull-*-lambda-gpu-gh200, pull-*-gcp-nvkind) are optional: true — post status but cannot block.
  • GitHub branch protection on main and release-25.8 lists EasyCLA as the only required status check (verified via gh api); rulesets are empty. That means no GH-Actions job (not lint, not unit, not image, not mock-nvml-e2e) is a required check either. A PR can merge with every CI job red as long as EasyCLA is green and tide/OWNERS approval lands. Effective merge gates: EasyCLA + LGTM/approval.

Stability (live testgrid snapshot 2026-04-21; these numbers move — re-check via curl -s https://testgrid.k8s.io/nvidia-dra/summary)

  • ci-dra-driver-nvidia-gpu-gcp-nvkind periodic: FLAKY 70% (7/10 recent columns).
  • pull-dra-driver-nvidia-gpu-lambda presubmit: FLAKY 50% (5/10).
  • pull-dra-driver-nvidia-gpu-lambda-gh200 presubmit: FLAKY 50% (1/2 — very low sample).
  • All three Lambda periodics: PASSING 100% recent.
  • With optional: true and chronic presubmit flake, signal is weak.
  • No testgrid-alert-email on any DRA-driver tab. Failures do not page anyone.
  • gpu_8x_v100_n quota-exceeded: Lambda account advertises capacity for an SKU it has no quota for; lambdactl watch treats quota errors as non-retryable and hard-fails. Three of the last ten presubmit attempts died this way. Fix options: (a) raise the V100-8x quota, (b) make quota errors retryable with a per-poll deny-list, or (c) set an explicit allow-list on the Prow job (e.g., GPU_TYPE=gpu_1x_a10,gpu_1x_a100_sxm4,gpu_1x_h100_pcie) so V100-8x is never considered.

Secrets / credential surface

  • Lambda API key (k8s secret lambda-ai-api-key) + Boskos-leased GCP project. Both relatively narrow — consistent with "off-cluster heavy lifting" pattern (no DinD or privileged on the Prow pod).

References (raw URLs)

Prow jobs

Testgrid

Repo

  • .github/workflows/mock-nvml-e2e.yaml
  • hack/ci/{lambda,gcp-nvkind,mock-nvml}/e2e-test.sh
  • tests/bats/Makefile (SUITE selectors)
  • test/e2e/ (Ginkgo)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment