Date: 2026-04-12 Scope: Forward-looking roadmap for expanding DRA GPU driver test coverage on Lambda Cloud. Covers only what remains to be done — not what's already landed or in flight.
Prerequisite: PRs #1025, #1027, #1028 should be merged first. After they land, Lambda CI runs 25 tests across 6 test files covering basic GPU allocation, CUDA workloads, Dynamic MIG, TimeSlicing, MPS, DRAExtendedResource, Prometheus metrics, CEL selectors, claim lifecycle, and robustness.
These tests already exist and work. Just add them to the tests-gpu-single Makefile target.
One test: install last-stable chart, run a workload, upgrade to current-dev, verify forward compatibility. Self-contained — pulls TEST_CHART_LASTSTABLE_VERSION from NGC.
Change: Add one line to Makefile. Tradeoff: Adds ~30-60s to presubmit. Pulls a second chart from the registry (network dependency). Instance: Any GPU.
One test: create a shared ResourceClaim, stamp out N pods sharing that GPU, repeat M times. CI defaults: N=20 pods, M=1 iteration.
Change: Add one line to Makefile.
Tradeoff: Adds ~30-45s of heavy scheduling. Could reduce N for Lambda (e.g., TEST_GPU_STRESS_PODS_N=10).
Instance: Any GPU.
The existing test_basics.bats has 9 tests but assumes GPU Operator. Several are GPU-Operator-independent and useful:
- Helm release name/version validation
- CRD existence check
- Pod readiness check
- SIGUSR2 goroutine dump
Change: Tag the GPU-Operator-dependent tests (e.g., gpu-operator) and exclude them via TEST_FILTER_TAGS. Or extract the portable tests into a new file.
Instance: Any GPU.
These require new BATS test logic but work on any Lambda GPU instance ($1.29/hr on A10).
Install with featureGates.NVMLDeviceHealthCheck=true. Verify:
- Kubelet-plugin logs contain "Starting NVML event monitor"
- No
DeviceTaintresources exist (healthy GPUs shouldn't be tainted)
~15 lines of BATS. Cannot simulate real XID errors on Lambda — smoke test only.
Tag: health
Mutual exclusion: NVMLDeviceHealthCheck is mutually exclusive with DynamicMIG and PassthroughSupport. Must be in a separate setup_file that installs the chart with a different feature gate set.
Install with controller.leaderElection.enabled=true, controller.replicas=2, resources.computeDomains.enabled=true. Verify:
- Two controller pods become Ready
- Exactly one pod's logs contain "successfully acquired lease"
~20 lines of BATS. Single-node is fine — both replicas run on the same node.
Note: Requires resources.computeDomains.enabled=true so the controller Deployment is created. On non-B200 instances, the compute-domains container in the kubelet-plugin will crash (no IMEX), but the controller itself should still run and elect a leader.
Tag: controller
Install with webhook.enabled=true. Verify admission-time rejection of invalid ResourceClaim parameters:
- Valid opaque config accepted
- Unknown field rejected (HTTP 422)
- Invalid sharing strategy rejected
~40 lines of BATS, plus TLS scaffolding. Options:
- Self-signed cert via
opensslin test setup (no external deps) - cert-manager (heavier, needs to be installed)
Tag: webhook
Blocker: TLS certificate generation in the test harness.
Install with compute domains enabled. Curl controller's :8080/metrics. Verify:
nvidia_dra_compute_domain_infogauge exists- Standard client-go REST metrics present
~15 lines of BATS.
Tag: controller
Instance: Cheapest instance where compute domains auto-enable. Currently B200 ($6.99/hr). If the graceful-degradation IMEX fix (already on main) works on all instances, controller metrics should be available anywhere resources.computeDomains.enabled=true is set.
Two QA-plan tests not yet written:
TC-MIG-DYN-004 (over-capacity): With DynamicMIG=true on A100-40GB, request 8 x 1g.5gb MIG profiles (max is 7). Assert the 8th claim stays Pending with a capacity error.
TC-MIG-DYN-005 (fragmentation): Allocate several 1g.5gb profiles, then request a 3g.20gb profile. Assert it stays Pending due to fragmentation.
~25 lines each. Require a multi-GPU MIG instance (gpu_8x_a100 at $15.92/hr or gpu_8x_h100_sxm5 at $31.92/hr) to avoid the single-GPU A100 "In use by another client" issue.
Tag: dynmig,multi-gpu
Three existing tests in test_gpu_mig.bats that use the nvmm helper to exec into GPU Operator's mig-manager pod. Lambda has no GPU Operator.
What needs to change: Replace the nvmm-based mig_create_1g0_on_node() and mig_ensure_teardown_on_all_nodes() helpers in helpers.sh with direct nvidia-smi mig commands. Options:
- Run
nvidia-smi migon the host (via SSH frome2e-test.sh, like the MIG pre-cleanup) - Run via a privileged pod that has access to
/dev/nvidia* - Create a
tests/bats/lib/lambda/nvmm-directscript that runsnvidia-smicommands on the node
Same "In use" limitation: Single-GPU A100 fails. Use H100/GH200/B200 for single-GPU, or 8x A100 for multi-GPU.
Tag: mig
Effort: Medium — the helper functions need rewriting, but the test logic and spec YAMLs already exist.
Three to five tests from test_cd_imex_chan_inject.bats and test_cd_misc.bats that only need IMEX channels on a single node (not multi-node MNNVL fabric).
Status: Investigated on gpu_8x_a100 — nvidia-fabricmanager is running but /dev/nvidia-caps-imex-channels/ doesn't exist. IMEX channel device nodes need to be created manually via mknod or the NVreg_CreateImexChannel0 kernel module parameter.
Next step: On the next 8x H100 SXM5 run, test:
modprobe nvidia NVreg_CreateImexChannel0=1
ls /dev/nvidia-caps-imex-channels/If IMEX channels appear, the single-node CD tests become feasible.
Tag: compute-domain
Instance: gpu_8x_h100_sxm5 ($31.92/hr) or gpu_8x_a100 ($15.92/hr)
Currently only a presubmit exists (always_run: true, optional: true). Add a periodic job for tests too slow or expensive for presubmit:
# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-nightly
interval: 6h
GPU_TYPE: "" # or pin to gpu_1x_h100_sxm5 for DynMIGThis periodic would run the full tests-gpu-single plus upgrade/downgrade and stress tests. Define a new Make target tests-gpu-single-extended or pass additional env vars.
Repo: kubernetes/test-infra Effort: One YAML file.
For weekly multi-GPU and MIG-specific runs:
# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-weekly-multigpu
interval: 168h # weekly
GPU_TYPE: "gpu_8x_h100_sxm5"This would run the full tests-gpu target (which includes static MIG, stress, upgrade/downgrade, and all multi-GPU tests).
Cost: ~$16/run for 30 min on 8x H100.
Lambda now offers B200 instances (1x at $6.99/hr through 8x at $53.52/hr). These are Blackwell architecture (compute capability 10.0) and the only Lambda instances where compute domains auto-enable.
Action: Run the existing presubmit on gpu_1x_b200_sxm6 to validate:
- Driver builds and loads on Blackwell
- Basic GPU allocation works
- Compute domain controller starts (auto-enabled on B200)
- Controller metrics are accessible
No new test code needed — just launch with GPU_TYPE=gpu_1x_b200_sxm6.
| Use Case | Best Instance | $/hr | Why |
|---|---|---|---|
| Presubmit (every PR) | Any available ("") |
$1.29-4.29 | Cheapest. Covers 15-25 tests depending on GPU. |
| DynMIG (reliable) | gpu_1x_h100_pcie |
$3.29 | Cheapest MIG GPU where DynMIG works reliably. |
| Multi-GPU (cheapest) | gpu_2x_h100_sxm5 |
$8.38 | Only need 2 GPUs for most multi-GPU tests. |
| Compute domains | gpu_1x_b200_sxm6 |
$6.99 | Cheapest B200. CD auto-enables. |
| Full coverage | gpu_8x_h100_sxm5 |
$31.92 | Multi-GPU + MIG + NVSwitch. |
| arm64 | gpu_1x_gh200 |
$2.29 | Only arm64 option. |
Note: gpu_2x_h100_sxm5 ($8.38/hr) is better value than gpu_8x_a100 ($15.92/hr) for tests that only need 2+ GPUs.
| Feature | Reason | What Would Be Needed |
|---|---|---|
| Multi-node CD workloads (nvbandwidth, failover) | Needs MNNVL fabric across nodes | DGX/HGX cluster |
| CD failover (force-delete across nodes) | Needs multi-node CD | Same |
| CD upgrade/downgrade with running workloads | Needs multi-node CD | Same |
NVLink fabric error handling (CrashOnNVLinkFabricErrors) |
Needs fabric-attached GPUs with errors | Same |
| VFIO passthrough | IOMMU not enabled on Lambda | Needs BIOS/kernel IOMMU support |
| OpenShift 4.21 | Different platform | OCP cluster |
| Real XID fault injection | Needs hardware error simulation | Specialized test hardware |
| Multi-node MIG management | Needs MIG on worker nodes via Operator | Multi-node cluster + GPU Operator |
| # | Item | Effort | New Code? | Blocked? |
|---|---|---|---|---|
| 1 | Add test_gpu_updowngrade.bats to tests-gpu-single |
1 line | No | No |
| 2 | Add test_gpu_stress.bats to tests-gpu-single |
1 line | No | No |
| 3 | NVML health smoke test | ~15 lines | Yes | No |
| 4 | Leader election test | ~20 lines | Yes | No |
| 5 | Create periodic Prow job (nightly) | 1 YAML file | No | No |
| 6 | DynMIG negative tests (TC-MIG-DYN-004/005) | ~50 lines | Yes | Needs multi-GPU MIG instance |
| 7 | Static MIG nvmm replacement |
Modify helpers.sh | Partial | Same "In use" A100 limitation |
| 8 | Webhook validation test | ~40 lines + TLS | Yes | TLS scaffolding needed |
| 9 | B200 Blackwell validation run | 0 code | No | Instance availability |
| 10 | Investigate IMEX channels on 8x H100 | 5 seconds on instance | No | Instance availability |
| 11 | Create weekly multi-GPU Prow job | 1 YAML file | No | No |
| 12 | Extract portable test_basics.bats tests |
Tag or split file | Minimal | No |
| 13 | Controller metrics test | ~15 lines | Yes | B200 instance |
| 14 | Node reboot recovery (#951) | ~30 lines | Yes | Disruptive, periodic only |
| Instance | GPUs | Arch | MIG | Multi-GPU | CD Auto-Enable | $/hr |
|---|---|---|---|---|---|---|
| gpu_1x_a10 | 1x A10 24GB | amd64 | No | No | No | $1.29 |
| gpu_1x_a100_sxm4 | 1x A100 40GB | amd64 | Yes* | No | No | $1.99 |
| gpu_1x_gh200 | 1x GH200 96GB | arm64 | Yes | No | No | $2.29 |
| gpu_1x_h100_pcie | 1x H100 80GB | amd64 | Yes | No | No | $3.29 |
| gpu_1x_h100_sxm5 | 1x H100 80GB | amd64 | Yes | No | No | $4.29 |
| gpu_1x_b200_sxm6 | 1x B200 180GB | amd64 | Yes | No | Yes | $6.99 |
| gpu_8x_v100_n | 8x V100 16GB | amd64 | No | Yes | No | $6.32 |
| gpu_2x_h100_sxm5 | 2x H100 80GB | amd64 | Yes | Yes | No | $8.38 |
| gpu_2x_b200_sxm6 | 2x B200 180GB | amd64 | Yes | Yes | Yes | $13.78 |
| gpu_8x_a100 | 8x A100 40GB | amd64 | Yes | Yes | No | $15.92 |
| gpu_4x_h100_sxm5 | 4x H100 80GB | amd64 | Yes | Yes | No | $16.36 |
| gpu_8x_a100_80gb | 8x A100 80GB | amd64 | Yes | Yes | No | $22.32 |
| gpu_4x_b200_sxm6 | 4x B200 180GB | amd64 | Yes | Yes | Yes | $27.16 |
| gpu_8x_h100_sxm5 | 8x H100 80GB | amd64 | Yes | Yes | No | $31.92 |
| gpu_8x_b200_sxm6 | 8x B200 180GB | amd64 | Yes | Yes | Yes | $53.52 |
* DynMIG fails on single-GPU A100 ("In use by another client"). Works on multi-GPU A100 (8x) and all H100/GH200/B200.
Features with zero BATS test coverage anywhere in the repo:
| Feature | Feature Gate | Default | What It Does |
|---|---|---|---|
| NVML Device Health | NVMLDeviceHealthCheck |
off | XID error monitoring, DeviceTaint creation |
| VFIO Passthrough | PassthroughSupport |
off | GPU passthrough via vfio-pci driver |
| Device Metadata | DeviceMetadata |
off | Generates metadata files for prepared devices |
| Webhook | webhook.enabled |
off | Admission-time validation of ResourceClaim parameters |
| Leader Election | controller.leaderElection.enabled |
off | HA controller with Lease-based leader election |
| Network Policies | *.networkPolicy.enabled |
off | Kubernetes NetworkPolicy for driver pods |
| Controller Replicas > 1 | controller.replicas |
1 | Multiple controller instances |
| pprof Profiling | controller.metrics.profilePath |
"" | Runtime profiling endpoint |
Features with partial coverage (tested in some context but not comprehensively):
| Feature | What's Tested | What's NOT Tested |
|---|---|---|
| DynamicMIG | Basic allocation, multi-container | Over-capacity rejection, fragmentation blocking |
| ComputeDomainCliques | One explicit test in test_cd_misc.bats |
Full lifecycle, edge cases |
| IMEXDaemonsWithDNSNames | Implicitly used | Legacy IP mode never tested |
| CrashOnNVLinkFabricErrors | Default-on but never exercised | No test validates crash vs fallback |