Lambda Cloud GPU Test Coverage: What's Next

Date: 2026-04-12 Scope: Forward-looking roadmap for expanding DRA GPU driver test coverage on Lambda Cloud. Covers only what remains to be done — not what's already landed or in flight.

Prerequisite: PRs #1025, #1027, #1028 should be merged first. After they land, Lambda CI runs 25 tests across 6 test files covering basic GPU allocation, CUDA workloads, Dynamic MIG, TimeSlicing, MPS, DRAExtendedResource, Prometheus metrics, CEL selectors, claim lifecycle, and robustness.

1. Zero-Code Wins: Add Existing Tests to Lambda CI

These tests already exist and work. Just add them to the tests-gpu-single Makefile target.

1.1 Helm Upgrade/Downgrade (`test_gpu_updowngrade.bats`)

One test: install last-stable chart, run a workload, upgrade to current-dev, verify forward compatibility. Self-contained — pulls TEST_CHART_LASTSTABLE_VERSION from NGC.

Change: Add one line to Makefile. Tradeoff: Adds ~30-60s to presubmit. Pulls a second chart from the registry (network dependency). Instance: Any GPU.

1.2 Stress Test (`test_gpu_stress.bats`)

One test: create a shared ResourceClaim, stamp out N pods sharing that GPU, repeat M times. CI defaults: N=20 pods, M=1 iteration.

Change: Add one line to Makefile. Tradeoff: Adds ~30-45s of heavy scheduling. Could reduce N for Lambda (e.g., TEST_GPU_STRESS_PODS_N=10). Instance: Any GPU.

1.3 Selective `test_basics.bats` Tests

The existing test_basics.bats has 9 tests but assumes GPU Operator. Several are GPU-Operator-independent and useful:

Helm release name/version validation
CRD existence check
Pod readiness check
SIGUSR2 goroutine dump

Change: Tag the GPU-Operator-dependent tests (e.g., gpu-operator) and exclude them via TEST_FILTER_TAGS. Or extract the portable tests into a new file. Instance: Any GPU.

2. New Tests to Write (Any GPU)

These require new BATS test logic but work on any Lambda GPU instance ($1.29/hr on A10).

2.1 NVML Health Monitor Smoke Test

Install with featureGates.NVMLDeviceHealthCheck=true. Verify:

Kubelet-plugin logs contain "Starting NVML event monitor"
No DeviceTaint resources exist (healthy GPUs shouldn't be tainted)

~15 lines of BATS. Cannot simulate real XID errors on Lambda — smoke test only.

Tag: health Mutual exclusion: NVMLDeviceHealthCheck is mutually exclusive with DynamicMIG and PassthroughSupport. Must be in a separate setup_file that installs the chart with a different feature gate set.

2.2 Leader Election

Install with controller.leaderElection.enabled=true, controller.replicas=2, resources.computeDomains.enabled=true. Verify:

Two controller pods become Ready
Exactly one pod's logs contain "successfully acquired lease"

~20 lines of BATS. Single-node is fine — both replicas run on the same node.

Note: Requires resources.computeDomains.enabled=true so the controller Deployment is created. On non-B200 instances, the compute-domains container in the kubelet-plugin will crash (no IMEX), but the controller itself should still run and elect a leader.

Tag: controller

2.3 Webhook Validation

Install with webhook.enabled=true. Verify admission-time rejection of invalid ResourceClaim parameters:

Valid opaque config accepted
Unknown field rejected (HTTP 422)
Invalid sharing strategy rejected

~40 lines of BATS, plus TLS scaffolding. Options:

Self-signed cert via openssl in test setup (no external deps)
cert-manager (heavier, needs to be installed)

Tag: webhook Blocker: TLS certificate generation in the test harness.

2.4 Controller Prometheus Metrics

Install with compute domains enabled. Curl controller's :8080/metrics. Verify:

nvidia_dra_compute_domain_info gauge exists
Standard client-go REST metrics present

~15 lines of BATS.

Tag: controller Instance: Cheapest instance where compute domains auto-enable. Currently B200 ($6.99/hr). If the graceful-degradation IMEX fix (already on main) works on all instances, controller metrics should be available anywhere resources.computeDomains.enabled=true is set.

3. New Tests Requiring Specific Hardware

3.1 DynMIG Negative Tests (Multi-GPU MIG Instance)

Two QA-plan tests not yet written:

TC-MIG-DYN-004 (over-capacity): With DynamicMIG=true on A100-40GB, request 8 x 1g.5gb MIG profiles (max is 7). Assert the 8th claim stays Pending with a capacity error.

TC-MIG-DYN-005 (fragmentation): Allocate several 1g.5gb profiles, then request a 3g.20gb profile. Assert it stays Pending due to fragmentation.

~25 lines each. Require a multi-GPU MIG instance (gpu_8x_a100 at $15.92/hr or gpu_8x_h100_sxm5 at $31.92/hr) to avoid the single-GPU A100 "In use by another client" issue.

Tag: dynmig,multi-gpu

3.2 Static MIG on Lambda

Three existing tests in test_gpu_mig.bats that use the nvmm helper to exec into GPU Operator's mig-manager pod. Lambda has no GPU Operator.

What needs to change: Replace the nvmm-based mig_create_1g0_on_node() and mig_ensure_teardown_on_all_nodes() helpers in helpers.sh with direct nvidia-smi mig commands. Options:

Run nvidia-smi mig on the host (via SSH from e2e-test.sh, like the MIG pre-cleanup)
Run via a privileged pod that has access to /dev/nvidia*
Create a tests/bats/lib/lambda/nvmm-direct script that runs nvidia-smi commands on the node

Same "In use" limitation: Single-GPU A100 fails. Use H100/GH200/B200 for single-GPU, or 8x A100 for multi-GPU.

Tag: mig Effort: Medium — the helper functions need rewriting, but the test logic and spec YAMLs already exist.

3.3 Single-Node Compute Domain Tests (8x SXM with IMEX)

Three to five tests from test_cd_imex_chan_inject.bats and test_cd_misc.bats that only need IMEX channels on a single node (not multi-node MNNVL fabric).

Status: Investigated on gpu_8x_a100 — nvidia-fabricmanager is running but /dev/nvidia-caps-imex-channels/ doesn't exist. IMEX channel device nodes need to be created manually via mknod or the NVreg_CreateImexChannel0 kernel module parameter.

Next step: On the next 8x H100 SXM5 run, test:

modprobe nvidia NVreg_CreateImexChannel0=1
ls /dev/nvidia-caps-imex-channels/

If IMEX channels appear, the single-node CD tests become feasible.

Tag: compute-domain Instance: gpu_8x_h100_sxm5 ($31.92/hr) or gpu_8x_a100 ($15.92/hr)

4. Infrastructure Improvements

4.1 Periodic Prow Job

Currently only a presubmit exists (always_run: true, optional: true). Add a periodic job for tests too slow or expensive for presubmit:

# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-nightly
interval: 6h
GPU_TYPE: ""  # or pin to gpu_1x_h100_sxm5 for DynMIG

This periodic would run the full tests-gpu-single plus upgrade/downgrade and stress tests. Define a new Make target tests-gpu-single-extended or pass additional env vars.

Repo: kubernetes/test-infra Effort: One YAML file.

4.2 GPU-Type-Pinned Periodic Jobs

For weekly multi-GPU and MIG-specific runs:

# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-weekly-multigpu
interval: 168h  # weekly
GPU_TYPE: "gpu_8x_h100_sxm5"

This would run the full tests-gpu target (which includes static MIG, stress, upgrade/downgrade, and all multi-GPU tests).

Cost: ~$16/run for 30 min on 8x H100.

4.3 B200 Blackwell Validation

Lambda now offers B200 instances (1x at $6.99/hr through 8x at $53.52/hr). These are Blackwell architecture (compute capability 10.0) and the only Lambda instances where compute domains auto-enable.

Action: Run the existing presubmit on gpu_1x_b200_sxm6 to validate:

Driver builds and loads on Blackwell
Basic GPU allocation works
Compute domain controller starts (auto-enabled on B200)
Controller metrics are accessible

No new test code needed — just launch with GPU_TYPE=gpu_1x_b200_sxm6.

4.4 Cost-Optimal Instance Selection

Use Case	Best Instance	$/hr	Why
Presubmit (every PR)	Any available (`""`)	$1.29-4.29	Cheapest. Covers 15-25 tests depending on GPU.
DynMIG (reliable)	`gpu_1x_h100_pcie`	$3.29	Cheapest MIG GPU where DynMIG works reliably.
Multi-GPU (cheapest)	`gpu_2x_h100_sxm5`	$8.38	Only need 2 GPUs for most multi-GPU tests.
Compute domains	`gpu_1x_b200_sxm6`	$6.99	Cheapest B200. CD auto-enables.
Full coverage	`gpu_8x_h100_sxm5`	$31.92	Multi-GPU + MIG + NVSwitch.
arm64	`gpu_1x_gh200`	$2.29	Only arm64 option.

Note: gpu_2x_h100_sxm5 ($8.38/hr) is better value than gpu_8x_a100 ($15.92/hr) for tests that only need 2+ GPUs.

5. Not Feasible on Lambda

Feature	Reason	What Would Be Needed
Multi-node CD workloads (nvbandwidth, failover)	Needs MNNVL fabric across nodes	DGX/HGX cluster
CD failover (force-delete across nodes)	Needs multi-node CD	Same
CD upgrade/downgrade with running workloads	Needs multi-node CD	Same
NVLink fabric error handling (`CrashOnNVLinkFabricErrors`)	Needs fabric-attached GPUs with errors	Same
VFIO passthrough	IOMMU not enabled on Lambda	Needs BIOS/kernel IOMMU support
OpenShift 4.21	Different platform	OCP cluster
Real XID fault injection	Needs hardware error simulation	Specialized test hardware
Multi-node MIG management	Needs MIG on worker nodes via Operator	Multi-node cluster + GPU Operator

6. Prioritized Action Items

#	Item	Effort	New Code?	Blocked?
1	Add `test_gpu_updowngrade.bats` to `tests-gpu-single`	1 line	No	No
2	Add `test_gpu_stress.bats` to `tests-gpu-single`	1 line	No	No
3	NVML health smoke test	~15 lines	Yes	No
4	Leader election test	~20 lines	Yes	No
5	Create periodic Prow job (nightly)	1 YAML file	No	No
6	DynMIG negative tests (TC-MIG-DYN-004/005)	~50 lines	Yes	Needs multi-GPU MIG instance
7	Static MIG `nvmm` replacement	Modify helpers.sh	Partial	Same "In use" A100 limitation
8	Webhook validation test	~40 lines + TLS	Yes	TLS scaffolding needed
9	B200 Blackwell validation run	0 code	No	Instance availability
10	Investigate IMEX channels on 8x H100	5 seconds on instance	No	Instance availability
11	Create weekly multi-GPU Prow job	1 YAML file	No	No
12	Extract portable `test_basics.bats` tests	Tag or split file	Minimal	No
13	Controller metrics test	~15 lines	Yes	B200 instance
14	Node reboot recovery (#951)	~30 lines	Yes	Disruptive, periodic only

7. Lambda GPU Instance Reference

Instance	GPUs	Arch	MIG	Multi-GPU	CD Auto-Enable	$/hr
gpu_1x_a10	1x A10 24GB	amd64	No	No	No	$1.29
gpu_1x_a100_sxm4	1x A100 40GB	amd64	Yes*	No	No	$1.99
gpu_1x_gh200	1x GH200 96GB	arm64	Yes	No	No	$2.29
gpu_1x_h100_pcie	1x H100 80GB	amd64	Yes	No	No	$3.29
gpu_1x_h100_sxm5	1x H100 80GB	amd64	Yes	No	No	$4.29
gpu_1x_b200_sxm6	1x B200 180GB	amd64	Yes	No	Yes	$6.99
gpu_8x_v100_n	8x V100 16GB	amd64	No	Yes	No	$6.32
gpu_2x_h100_sxm5	2x H100 80GB	amd64	Yes	Yes	No	$8.38
gpu_2x_b200_sxm6	2x B200 180GB	amd64	Yes	Yes	Yes	$13.78
gpu_8x_a100	8x A100 40GB	amd64	Yes	Yes	No	$15.92
gpu_4x_h100_sxm5	4x H100 80GB	amd64	Yes	Yes	No	$16.36
gpu_8x_a100_80gb	8x A100 80GB	amd64	Yes	Yes	No	$22.32
gpu_4x_b200_sxm6	4x B200 180GB	amd64	Yes	Yes	Yes	$27.16
gpu_8x_h100_sxm5	8x H100 80GB	amd64	Yes	Yes	No	$31.92
gpu_8x_b200_sxm6	8x B200 180GB	amd64	Yes	Yes	Yes	$53.52

* DynMIG fails on single-GPU A100 ("In use by another client"). Works on multi-GPU A100 (8x) and all H100/GH200/B200.

8. Untested Driver Features (Complete List)

Features with zero BATS test coverage anywhere in the repo:

Feature	Feature Gate	Default	What It Does
NVML Device Health	`NVMLDeviceHealthCheck`	off	XID error monitoring, DeviceTaint creation
VFIO Passthrough	`PassthroughSupport`	off	GPU passthrough via vfio-pci driver
Device Metadata	`DeviceMetadata`	off	Generates metadata files for prepared devices
Webhook	`webhook.enabled`	off	Admission-time validation of ResourceClaim parameters
Leader Election	`controller.leaderElection.enabled`	off	HA controller with Lease-based leader election
Network Policies	`*.networkPolicy.enabled`	off	Kubernetes NetworkPolicy for driver pods
Controller Replicas > 1	`controller.replicas`	1	Multiple controller instances
pprof Profiling	`controller.metrics.profilePath`	""	Runtime profiling endpoint

Features with partial coverage (tested in some context but not comprehensively):

Feature	What's Tested	What's NOT Tested
DynamicMIG	Basic allocation, multi-container	Over-capacity rejection, fragmentation blocking
ComputeDomainCliques	One explicit test in `test_cd_misc.bats`	Full lifecycle, edge cases
IMEXDaemonsWithDNSNames	Implicitly used	Legacy IP mode never tested
CrashOnNVLinkFabricErrors	Default-on but never exercised	No test validates crash vs fallback

dims/2026-04-12-lambda-gpu-test-roadmap-v2.md

Select an option

No results found

Select an option

No results found

Lambda Cloud GPU Test Coverage: What's Next

1. Zero-Code Wins: Add Existing Tests to Lambda CI

1.1 Helm Upgrade/Downgrade (`test_gpu_updowngrade.bats`)

1.2 Stress Test (`test_gpu_stress.bats`)

1.3 Selective `test_basics.bats` Tests

2. New Tests to Write (Any GPU)

2.1 NVML Health Monitor Smoke Test

2.2 Leader Election

2.3 Webhook Validation

2.4 Controller Prometheus Metrics

3. New Tests Requiring Specific Hardware

3.1 DynMIG Negative Tests (Multi-GPU MIG Instance)

3.2 Static MIG on Lambda

3.3 Single-Node Compute Domain Tests (8x SXM with IMEX)

4. Infrastructure Improvements

4.1 Periodic Prow Job

4.2 GPU-Type-Pinned Periodic Jobs

4.3 B200 Blackwell Validation

4.4 Cost-Optimal Instance Selection

5. Not Feasible on Lambda

6. Prioritized Action Items

7. Lambda GPU Instance Reference

8. Untested Driver Features (Complete List)

dims/2026-04-12-lambda-gpu-test-roadmap-v2.md

Lambda Cloud GPU Test Coverage: What's Next

1. Zero-Code Wins: Add Existing Tests to Lambda CI

1.1 Helm Upgrade/Downgrade (test_gpu_updowngrade.bats)

1.2 Stress Test (test_gpu_stress.bats)

1.3 Selective test_basics.bats Tests

2. New Tests to Write (Any GPU)

2.1 NVML Health Monitor Smoke Test

2.2 Leader Election

2.3 Webhook Validation

2.4 Controller Prometheus Metrics

3. New Tests Requiring Specific Hardware

3.1 DynMIG Negative Tests (Multi-GPU MIG Instance)

3.2 Static MIG on Lambda

3.3 Single-Node Compute Domain Tests (8x SXM with IMEX)

4. Infrastructure Improvements

4.1 Periodic Prow Job

4.2 GPU-Type-Pinned Periodic Jobs

4.3 B200 Blackwell Validation

4.4 Cost-Optimal Instance Selection

5. Not Feasible on Lambda

6. Prioritized Action Items

7. Lambda GPU Instance Reference

8. Untested Driver Features (Complete List)

1.1 Helm Upgrade/Downgrade (`test_gpu_updowngrade.bats`)

1.2 Stress Test (`test_gpu_stress.bats`)

1.3 Selective `test_basics.bats` Tests