As of 2026-04-21. Sources: .github/workflows/, kubernetes/test-infra (config/jobs/kubernetes-sigs/dra-driver-nvidia-gpu/, config/testgrids/nvidia/nvidia.yaml), testgrid.k8s.io/nvidia-gpu, hack/ci/{gcp-nvkind,lambda,mock-nvml}, tests/bats/, test/e2e/.
- 3 execution surfaces: GitHub Actions (lint/unit/mock-e2e only), Prow on Lambda Cloud (real GPUs, BATS), Prow on GCP-nvkind (T4 GCE, Ginkgo).
- 7 Prow jobs on this repo: 3 e2e presubmits + 3 e2e periodics + 1 image-push postsubmit.
- Only Lambda/arm64 (GH200) gives real arm64 GPU coverage. GCP-nvkind is amd64/T4 only.
- Nothing is truly a required check. GitHub branch protection on
mainandrelease-25.8listsEasyCLAas the only required status. No rulesets configured. Every CI signal above — GH Actions lint/unit/mock-e2e and all 4 Prow e2e presubmits (optional: true) — posts status but cannot block merge. Merge gating is effectively: EasyCLA + tide/OWNERS approval. - No CI ever runs
tests-cd(ComputeDomain full suite on real NVLink fabric). Onlytests-mock-nvmlandtests-gpu-singleare wired. - DynMIG is exercised on CI —
test_gpu_dynmig.batsis intests-gpu-single, andhack/ci/lambda/e2e-test.shleaves DynMIG enabled on*h100*|*gh200*|*b200*. So every Lambda GH200 run does hit a dynamic-MIG path. Static MIG (test_gpu_mig.bats) still never runs in CI. - Lambda x86 jobs use
GPU_TYPE=""→lambdactl watchpicks cheapest-available-any-region. Last 10 periodic runs: 10/10gpu_1x_a10. Last 10 presubmit runs: 5× A10, 2× A100 SXM4, 3× blocked ongpu_8x_v100_nquota-exceeded (non-retryable, contributes to 50% presubmit flake).
Every distinct job/workflow that runs against this repo. Housekeeping bots (stale, cherry-pick, issue-triage) are excluded; see end of section.
Columns: "Gates" = Prow-level configuration only (e.g. always_run, optional, max_concurrency). No job in this table is a merge-required check — see TL;DR on branch protection.
| # | Job / Workflow | Platform | Type | Trigger / Cadence | Provider | GPU | Arch | K8s | Suite / Target | TestGrid tab | Gates | Status 2026-04-21 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ci.yaml → golang check |
GH Actions | PR + push to main/release-* | every PR/push | — | none | amd64 | — | make golangci-lint, generated-code check, go mod validate |
— | — | — |
| 2 | ci.yaml → golang test |
GH Actions | PR + push | every PR/push | — | none | amd64 | — | make test (Go unit) |
— | — | — |
| 3 | ci.yaml → golang build |
GH Actions | PR + push | every PR/push | — | none | amd64 | — | make build |
— | — | — |
| 4 | ci.yaml → image |
GH Actions | PR + push | every PR/push | — | none | amd64+arm64 (QEMU) | — | make build multi-arch OCI (no push) |
— | — | — |
| 5 | ci.yaml → chart |
GH Actions | PR + push | every PR/push | — | none | amd64 | — | helm lint + package |
— | — | — |
| 6 | code_scanning.yaml |
GH Actions | called from basic-checks | every PR/push | — | none | amd64 | — | CodeQL Go | — | — | — |
| 7 | mock-nvml-e2e.yaml |
GH Actions | PR (paths-filtered) + push main | on-PR | mock-nvml (Kind + mocked NVML) | virtual 8×GB200 | amd64 | latest stable | BATS tests-mock-nvml |
— | — | — |
| 8 | tests.yaml |
GH Actions | workflow_dispatch only | manual | — | — | — | — | placeholder (echoes "bats runs on Prow") | — | — | noop |
| 9 | pull-dra-driver-nvidia-gpu-e2e-lambda-gpu |
Prow | presubmit | every PR (skip release-*) | Lambda Cloud (kubeadm on bare metal) | GPU_TYPE="" → cheapest-available (see §GPU selection). Recent: A10 71%, A100 SXM4 29% |
amd64 | latest stable | BATS tests-gpu-single |
pull-dra-driver-nvidia-gpu-lambda |
always_run:true, optional:true, max_concurrency:1, 2h |
FLAKY 50% |
| 10 | pull-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200 |
Prow | presubmit | every PR (skip release-*) | Lambda Cloud | GH200 (1×) | arm64 | latest stable | BATS tests-gpu-single |
pull-dra-driver-nvidia-gpu-lambda-gh200 |
always_run:true, optional:true, max_concurrency:1, 2h |
FLAKY 50% |
| 11 | pull-dra-driver-nvidia-gpu-e2e-gcp-nvkind |
Prow | presubmit | every PR (skip release-*) | GCP-nvkind (GCE VM → nvkind) | T4 (1×) | amd64 | v1.34.3 (Ubuntu 22.04 DLVM) | Ginkgo test/e2e/ |
pull-dra-driver-nvidia-gpu-gcp-nvkind |
always_run:true, optional:true, max_concurrency:1, 2h, Boskos gpu-project |
PASSING |
| 12 | ci-dra-driver-nvidia-gpu-e2e-lambda-gpu |
Prow | periodic | interval: 6h |
Lambda Cloud | GPU_TYPE="" → cheapest-available. Recent 10/10: A10 @ us-east-1 |
amd64 | latest stable | BATS tests-gpu-single |
ci-dra-driver-nvidia-gpu-lambda |
2h | PASSING 100% |
| 13 | ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-gh200 |
Prow | periodic | cron: 30 0,6,12,18 * * * (6h, offset) |
Lambda Cloud | GH200 (1×) | arm64 | latest stable | BATS tests-gpu-single |
ci-dra-driver-nvidia-gpu-lambda-gh200 |
2h | PASSING 100% |
| 14 | ci-dra-driver-nvidia-gpu-e2e-gcp-nvkind |
Prow | periodic | interval: 6h |
GCP-nvkind | T4 (1×) | amd64 | v1.35.1 (Ubuntu 24.04 DLVM) | Ginkgo test/e2e/ |
ci-dra-driver-nvidia-gpu-gcp-nvkind |
2h, Boskos gpu-project |
FLAKY 70% (7/10 recent columns; live testgrid snapshot — numbers move) |
| 15 | post-dra-driver-nvidia-gpu-push-images |
Prow | postsubmit | merge to main, release-*, SemVer tags |
GCB image-builder | — | — | — | run.sh → push to k8s-staging-images |
sig-node-image-pushes, sig-k8s-infra-gcb |
trusted cluster | — |
Excluded (housekeeping bots): cherrypick.yml, issue-triage.yml, stale.yml (daily cron 04:30 UTC).
Notes on the master table:
- The Prow periodic GCP-nvkind pins
v1.35.1+ Ubuntu 24.04 while the presubmit pinsv1.34.3+ Ubuntu 22.04 — deliberate drift so periodics smoke-test newer k8s/OS. - The GH200 periodic uses a cron (
30 0,6,12,18) instead ofinterval:to offset 3h from the siblingci-kubernetes-e2e-lambda-device-plugin-gpu-gh200and avoid GH200 capacity contention. - All Lambda jobs carry preset
preset-lambda-credential→ injectsLAMBDA_API_KEY_FILE=/etc/lambda-cred/api-key. - All e2e jobs use the same container:
us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20260316-e86cefa561-master.
Which .bats file is passed to bats under each make -f tests/bats/Makefile <target>. Transcribed directly from tests/bats/Makefile (tests-mock-nvml:187, tests-gpu-single:204, tests-gpu:214, tests-cd:225, tests:236). File-included is not the same as test-executed: tests-mock-nvml sets MOCK_NVML=true, under which several tests auto-skip (per-@test guards), and hack/ci/mock-nvml/e2e-test.sh also applies --filter-tags exclusions (!cuda-workload,!dynmig,!mig,!compute-domain,!multi-node,!gpu-busgrind,!version-specific).
| BATS file | tests (full) |
tests-gpu |
tests-gpu-single |
tests-mock-nvml |
tests-cd |
Hardware requirement |
|---|---|---|---|---|---|---|
| test_basics.bats | ✓ | ✓ | — | — | ✓ | none (sanity; expects GPU Operator) |
| test_gpu_basic.bats | ✓ | ✓ | ✓ | ✓ | — | any GPU |
| test_gpu_extres.bats | ✓ | ✓ | ✓ | ✓ | — | K8s ≥1.35 + DRAExtendedResource |
| test_gpu_robustness.bats | — | — | ✓ | ✓ | — | any GPU |
| test_gpu_stress.bats | ✓ | ✓ | — | ✓ | — | any GPU |
| test_gpu_updowngrade.bats | ✓ | ✓ | — | ✓ | — | prior-release image in registry |
| test_gpu_sharing.bats | — | — | ✓ | ✓ | — | any GPU (real MPS daemon for one case) |
| test_gpu_dynmig.bats | ✓ | ✓ | ✓ | — | — | MIG-capable GPU + DynamicMIG=true |
| test_gpu_mig.bats | ✓ | ✓ | — | — | — | MIG-capable (A100/H100/B200/GB200) |
| test_gpu_cuda_workloads.bats | — | — | ✓ | ✓ (see note: 2 of 4 tests actually run under MOCK) | — | real CUDA compute (2 tests); other 2 just use ResourceClaimTemplate semantics |
| test_cd_imex_chan_inject.bats | ✓ | — | — | ✓ (tests auto-skip on MOCK_NVML=true) |
✓ | IMEX daemon (Blackwell + drv ≥570.158.01) |
| test_cd_logging.bats | ✓ | — | — | ✓ (auto-skip on MOCK) | ✓ | IMEX daemon |
| test_cd_misc.bats | ✓ | — | — | ✓ (auto-skip on MOCK) | ✓ | IMEX daemon |
| test_cd_updowngrade.bats | ✓ | — | — | ✓ (auto-skip on MOCK) | ✓ | IMEX daemon + prior-release image |
| test_cd_failover.bats | ✓ | — | — | ✓ (auto-skip on MOCK) | ✓ | multi-node NVLink fabric (≥2 nodes, 4 GPU/node) |
| test_cd_mnnvl_workload.bats | ✓ | — | — | ✓ (auto-skip on MOCK) | ✓ | multi-node NVLink fabric, real NCCL, MPI Operator |
| Files included | 13/16 | 7/16 | 6/16 | 13/16 | 7/16 | |
| Invoked by CI | — | — | Lambda presubmit + periodic (both arch) | GH Actions mock-nvml-e2e |
— |
Takeaways:
testsis not "all 16 bats files" — it excludesrobustness,sharing, andcuda_workloads.tests-mock-nvmlincludes all 6 CD files andcuda_workloads, but:- CD files: every
@testintest_cd_*.batsstarts with aMOCK_NVMLskip guard (tests/bats/test_cd_imex_chan_inject.bats:17etc.), so they contribute ~zero executed assertions on the mock runner. - cuda_workloads: the mock-runner filter
!cuda-workloadis a no-op — no test in that file carries thecuda-workloadtag (they're taggedgpu-workloadsandfastfeedback). Of the 4 tests intest_gpu_cuda_workloads.bats: the CUDA-demo-suite test (line 31) and the busGrind test (line 118) skip viaMOCK_NVMLguards; the Job-with-ResourceClaimTemplate and Deployment-2-replicas tests (lines 52, 82) do NOT skip and actually execute on the mock runner. So mock-nvml does exercise RCT/deployment paths, just not real CUDA compute.
- CD files: every
tests-gpu-singleincludestest_gpu_dynmig, so dynamic-MIG paths do get exercised in CI on GPUs the Lambda driver leaves unfiltered (H100 / GH200 / B200). Static MIG (test_gpu_mig) is only intests/tests-gpu, neither of which is wired to CI.- The comment in
hack/ci/mock-nvml/e2e-test.sh:377-379("We skiptest_gpu_cuda_workloads.batsbecause it includes a CUDA demo suite test …") is stale — the file is actually included viatests-mock-nvml, and skipping happens per-test viaMOCK_NVMLguards, not at the file level.
A compact view of what gets actually run where.
| Provider | Caller | Arch | GPU model | Real CUDA? | DynMIG? | Static MIG? | IMEX / ComputeDomain? | Multi-GPU? | Multi-node? | Suite run |
|---|---|---|---|---|---|---|---|---|---|---|
| GH Actions runner + mock-nvml | GH Actions PR | amd64 | 8× virtual GB200 | partial (non-tag tests run; one CUDA test auto-skips on MOCK) | ✗ (filter !dynmig) |
✗ (filter !mig) |
files included, but every CD test auto-skips on MOCK_NVML=true | ✓ (virtual 8×) | ✗ | tests-mock-nvml (13 files included, many skip at runtime) |
| Lambda (x86, A10) | Prow presubmit + 6h periodic | amd64 | A10 (most common) | ✓ | ✗ (A10 not MIG-capable → !dynmig) |
✗ | ✗ (CD disabled unless `gb200 | gb300 | b200`) | ✗ |
| Lambda (x86, A100) | Prow presubmit + 6h periodic (when A10 unavailable) | amd64 | A100 SXM4 40GB (1×) | ✓ | ✗ (single-GPU A100 → !dynmig per e2e-test.sh:112-120) |
✗ | ✗ | ✗ | ✗ | tests-gpu-single (6 files) |
| Lambda (arm64) | Prow presubmit + 6h periodic | arm64 | GH200 (1×) | ✓ (no busGrind — arm64 apt limitation) | ✓ (GH200 matches *gh200*, DynMIG enabled) |
✗ | ✗ (CD only on `gb200 | gb300 | b200`) | ✗ |
| GCP-nvkind | Prow presubmit + 6h periodic | amd64 | T4 (1×) | ✓ | ✗ (T4 not MIG-capable) | ✗ | ✗ | ✗ | ✗ | Ginkgo test/e2e/ (6 specs) |
The nvidia-gpu rollup on testgrid also displays 10 tabs from the NVIDIA device-plugin (k/k) program. Listed for context only — they do not test this driver but share the dashboard:
| Tab | Job | Status |
|---|---|---|
ci-kubernetes-e2e-ec2-device-plugin-gpu |
periodic | FLAKY 80% |
ci-lambda-device-plugin-gpu |
periodic | PASSING |
ci-lambda-device-plugin-gpu-gh200 |
periodic | FLAKY 70% |
gce-device-plugin-gpu-{1.33,1.34,1.35,1.36,master} |
periodic | PASSING / master FLAKY 90% |
pull-kubernetes-e2e-ec2-device-plugin-gpu |
presubmit | STALE (last run 2026-03-18) |
pull-lambda-device-plugin-gpu |
presubmit | FLAKY 50% |
The two Prow jobs ci-dra-driver-nvidia-gpu-e2e-lambda-gpu and pull-dra-driver-nvidia-gpu-e2e-lambda-gpu pass GPU_TYPE="". The resolution happens in two layers:
Layer 1 — experiment/lambda/lib/lambda-common.sh (test-infra):
LAMBDA_GPU_TYPE="${GPU_TYPE-gpu_1x_a10}" # '-' not ':-' → empty stays empty
...
if [ -n "${LAMBDA_GPU_TYPE}" ]; then
gpu_args=(--gpu "${LAMBDA_GPU_TYPE}")
fi
lambdactl --json watch "${gpu_args[@]}" --ssh ... --interval 30 --timeout 900 --wait-sshWhen empty, --gpu is omitted entirely — no filter, no region pin.
Layer 2 — lambdactl watch (dims/lambdactl, cmd/watch.go):
- Poll
lambdactl typesevery 30s for up to 900s. - Keep types with at least one region currently showing availability.
- Sort by
PriceCentsascending, pickcandidates[0]. - Launch into
Regions[0]of that type. - On a retryable capacity error →
continuethe loop and re-poll. On a quota error → hard-fail (not retryable).
After the launch returns, the script overwrites LAMBDA_GPU_TYPE with the actual provisioned type so BATS capability gating works on what really got allocated.
Cheapest-first, so first-available is what gets picked:
| Rank | SKU | $/hr | GPU | Arch | Current avail |
|---|---|---|---|---|---|
| 1 | gpu_1x_a10 |
$1.29 | A10 24GB PCIe | x86 | 1 region |
| 2 | gpu_1x_a100_sxm4 |
$1.99 | A100 40GB SXM4 | x86 | 2 regions |
| 3 | gpu_2x_a6000 |
$2.18 | 2×A6000 48GB | x86 | 0 |
| 4 | gpu_1x_gh200 |
$2.29 | GH200 96GB | arm64 | 1 region |
| 5 | gpu_1x_h100_pcie |
$3.29 | H100 80GB PCIe | x86 | 1 region |
| 6 | gpu_1x_h100_sxm5 |
$4.29 | H100 80GB SXM5 | x86 | 1 region |
| 7 | gpu_8x_v100_n |
$6.32 | 8×V100 16GB | x86 | 1 region |
| 8 | gpu_1x_b200_sxm6 |
$6.99 | B200 180GB SXM6 | x86 | 1 region |
| — | (heavier SKUs) | $8.38–$53.52 | 2×/4×/8× H100/B200/A100 | x86 | 0 |
Periodic ci-dra-driver-nvidia-gpu-e2e-lambda-gpu:
gpu_1x_a10 @ us-east-1 ########## 10/10 (100%)
All ten runs, A10 @ us-east-1. The cheapest SKU has been consistently available during periodic windows.
Presubmit pull-dra-driver-nvidia-gpu-e2e-lambda-gpu (last 10 attempts):
Actually launched (7/10):
gpu_1x_a10 @ us-east-1 ##### 5
gpu_1x_a100_sxm4 @ us-east-1/us-west-2 ## 2
Pre-launch quota-failed (3/10):
gpu_8x_v100_n @ us-south-2 ### 3 ← hard fail, no retry
Three consecutive presubmit failures on 2026-04-18 all hit the same trap: Lambda advertised gpu_8x_v100_n@us-south-2 as available (cheapest-with-capacity at that moment), lambdactl raced to launch it, and the account returned Quota exceeded, which lambdactl treats as non-retryable. This is a real contributor to the 50% flake on the presubmit tab.
One clean example of the capacity-retry path (build 2045306201683529728): gpu_1x_a10 @ us-west-1 hit "Not enough capacity" three times, then gpu_1x_a100_sxm4 @ us-east-1 became cheapest-available on the next poll and launched.
- "Lambda x86" ≠ A10. It is A10 most of the time, A100 SXM4 when the A10 pool is tight, and could be any cheaper-than-GH200 SKU if Lambda lowers prices or empties upper pools.
- MIG never fires even when A100 lands — the job invokes
tests-gpu-single, which excludestest_gpu_migandtest_gpu_dynmig. So the rare A100 runs are wasted for MIG coverage. - Quota-exceeded on
gpu_8x_v100_nis a latent bug. Either the test-infra account gets its V100-8x quota raised, orlambdactl watchneeds to learn to treat quota errors as retryable (with a short deny-list for that poll-loop iteration).
GPU hardware coverage
- In practice CI lands on: T4 (GCP-nvkind), A10 (Lambda x86, dominant), A100 SXM4 40GB (Lambda x86 fallback, occasional), GH200 (Lambda arm64). Everything else Lambda advertises (H100 PCIe/SXM5, B200 SXM6, V100) is cheaper-than-GH200 so could be selected, but in the last ~20 runs the cheaper SKUs (A10, A100) always won.
- Only
tests-mock-nvmlexercises GB200/B200 profiles — all synthetic. On this runner: every CD test skips at runtime (MOCK_NVMLguards), and 2 of 4cuda_workloadstests skip; the remaining 2 (RCT + 2-replica Deployment) do execute. - Static MIG (
test_gpu_mig.bats) never runs in CI — only appears intests/tests-gpu, neither of which is invoked. - Dynamic MIG (
test_gpu_dynmig.bats) runs only on Lambda GH200 — it's intests-gpu-single, buthack/ci/lambda/e2e-test.shfilters it out except for*h100*|*gh200*|*b200*. The x86 presubmit/periodic (A10, single-GPU A100) always filter!dynmig. If Lambda ever lands H100 PCIe or B200 on the x86 job, those would also exercise DynMIG.
Test-suite coverage
tests-cd(full ComputeDomain suite) is not run in any CI — the failover, logging, misc, multi-node workload, and CD-updowngrade tests only run locally or via manual/testoverrides if someone wires it up.tests-gpu(full GPU suite, includes MIG / stress / updowngrade) is not run in any CI — Lambda jobs use the-singlesubset.- MIG paths (
test_gpu_mig,test_gpu_dynmig) never execute in CI because no MIG-capable GPU is wired. - Real-CUDA bats tests (the CUDA-demo-suite and busGrind tests inside
test_gpu_cuda_workloads.bats) only run on Lambda (A10/A100/GH200). The other two tests in the same file (RCT + 2-replica Deployment) also run on mock-nvml. Nothing in GCP-nvkind exercises this file at all — it runs Ginkgotest/e2e/, not BATS.
Architecture coverage
- arm64 is covered only by Lambda GH200.
mock-nvml-e2e.yamlruns onubuntu-latest(amd64) with a multi-arch buildx image but the runtime is amd64. - GCP-nvkind is hard-coded amd64 (
linux-amd64download insetup-nvkind-node.sh).
Kubernetes version coverage
- GCP-nvkind periodic alone smoke-tests v1.35.1. Lambda uses "latest stable" unpinned.
- Release branches (
release-*): all four Prow presubmits haveskip_branches: [release-\d+\.\d+], so release branches get no e2e presubmit gating. Periodics are main-only (extra_refs: ...@main). Release branches only get GH-Actions lint/unit/mock-nvml.
Multi-node / NVLink fabric
- No CI runs multi-node. All ComputeDomain failover/MNNVL tests require ≥2 nodes with 4 GPUs each — nothing in
hack/ci/*provisions that topology.
Optionality / blocking
- All 3 Prow e2e presubmits (
pull-*-lambda-gpu,pull-*-lambda-gpu-gh200,pull-*-gcp-nvkind) areoptional: true— post status but cannot block. - GitHub branch protection on
mainandrelease-25.8listsEasyCLAas the only required status check (verified viagh api); rulesets are empty. That means no GH-Actions job (not lint, not unit, not image, not mock-nvml-e2e) is a required check either. A PR can merge with every CI job red as long as EasyCLA is green and tide/OWNERS approval lands. Effective merge gates: EasyCLA + LGTM/approval.
Stability (live testgrid snapshot 2026-04-21; these numbers move — re-check via curl -s https://testgrid.k8s.io/nvidia-dra/summary)
ci-dra-driver-nvidia-gpu-gcp-nvkindperiodic: FLAKY 70% (7/10 recent columns).pull-dra-driver-nvidia-gpu-lambdapresubmit: FLAKY 50% (5/10).pull-dra-driver-nvidia-gpu-lambda-gh200presubmit: FLAKY 50% (1/2 — very low sample).- All three Lambda periodics: PASSING 100% recent.
- With
optional: trueand chronic presubmit flake, signal is weak. - No
testgrid-alert-emailon any DRA-driver tab. Failures do not page anyone. gpu_8x_v100_nquota-exceeded: Lambda account advertises capacity for an SKU it has no quota for;lambdactl watchtreats quota errors as non-retryable and hard-fails. Three of the last ten presubmit attempts died this way. Fix options: (a) raise the V100-8x quota, (b) make quota errors retryable with a per-poll deny-list, or (c) set an explicit allow-list on the Prow job (e.g.,GPU_TYPE=gpu_1x_a10,gpu_1x_a100_sxm4,gpu_1x_h100_pcie) so V100-8x is never considered.
Secrets / credential surface
- Lambda API key (k8s secret
lambda-ai-api-key) + Boskos-leased GCP project. Both relatively narrow — consistent with "off-cluster heavy lifting" pattern (no DinD or privileged on the Prow pod).
Prow jobs
- https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/dra-driver-nvidia-gpu/dra-driver-nvidia-gpu-lambda.yaml
- https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/dra-driver-nvidia-gpu/dra-driver-nvidia-gpu-gcp-nvkind.yaml
- https://github.com/kubernetes/test-infra/blob/master/config/jobs/image-pushing/k8s-staging-dra-driver-nvidia-gpu.yaml
Testgrid
- https://testgrid.k8s.io/nvidia-gpu (rollup — 16 tabs)
- https://testgrid.k8s.io/nvidia-dra (repo-scoped — 6 tabs)
- https://testgrid.k8s.io/nvidia-arm64 (3 tabs, GH200 only)
- https://testgrid.k8s.io/nvidia-presubmits
- https://testgrid.k8s.io/nvidia-periodics
- https://github.com/kubernetes/test-infra/blob/master/config/testgrids/nvidia/nvidia.yaml
Repo
.github/workflows/mock-nvml-e2e.yamlhack/ci/{lambda,gcp-nvkind,mock-nvml}/e2e-test.shtests/bats/Makefile(SUITE selectors)test/e2e/(Ginkgo)