Skip to content

Instantly share code, notes, and snippets.

@dims
Created April 12, 2026 20:41
Show Gist options
  • Select an option

  • Save dims/1923a2dd44b05b7cf2f9fd595dbf3459 to your computer and use it in GitHub Desktop.

Select an option

Save dims/1923a2dd44b05b7cf2f9fd595dbf3459 to your computer and use it in GitHub Desktop.
Lambda Cloud GPU Test Coverage: What's Next (v2 roadmap)

Lambda Cloud GPU Test Coverage: What's Next

Date: 2026-04-12 Scope: Forward-looking roadmap for expanding DRA GPU driver test coverage on Lambda Cloud. Covers only what remains to be done — not what's already landed or in flight.

Prerequisite: PRs #1025, #1027, #1028 should be merged first. After they land, Lambda CI runs 25 tests across 6 test files covering basic GPU allocation, CUDA workloads, Dynamic MIG, TimeSlicing, MPS, DRAExtendedResource, Prometheus metrics, CEL selectors, claim lifecycle, and robustness.


1. Zero-Code Wins: Add Existing Tests to Lambda CI

These tests already exist and work. Just add them to the tests-gpu-single Makefile target.

1.1 Helm Upgrade/Downgrade (test_gpu_updowngrade.bats)

One test: install last-stable chart, run a workload, upgrade to current-dev, verify forward compatibility. Self-contained — pulls TEST_CHART_LASTSTABLE_VERSION from NGC.

Change: Add one line to Makefile. Tradeoff: Adds ~30-60s to presubmit. Pulls a second chart from the registry (network dependency). Instance: Any GPU.

1.2 Stress Test (test_gpu_stress.bats)

One test: create a shared ResourceClaim, stamp out N pods sharing that GPU, repeat M times. CI defaults: N=20 pods, M=1 iteration.

Change: Add one line to Makefile. Tradeoff: Adds ~30-45s of heavy scheduling. Could reduce N for Lambda (e.g., TEST_GPU_STRESS_PODS_N=10). Instance: Any GPU.

1.3 Selective test_basics.bats Tests

The existing test_basics.bats has 9 tests but assumes GPU Operator. Several are GPU-Operator-independent and useful:

  • Helm release name/version validation
  • CRD existence check
  • Pod readiness check
  • SIGUSR2 goroutine dump

Change: Tag the GPU-Operator-dependent tests (e.g., gpu-operator) and exclude them via TEST_FILTER_TAGS. Or extract the portable tests into a new file. Instance: Any GPU.


2. New Tests to Write (Any GPU)

These require new BATS test logic but work on any Lambda GPU instance ($1.29/hr on A10).

2.1 NVML Health Monitor Smoke Test

Install with featureGates.NVMLDeviceHealthCheck=true. Verify:

  • Kubelet-plugin logs contain "Starting NVML event monitor"
  • No DeviceTaint resources exist (healthy GPUs shouldn't be tainted)

~15 lines of BATS. Cannot simulate real XID errors on Lambda — smoke test only.

Tag: health Mutual exclusion: NVMLDeviceHealthCheck is mutually exclusive with DynamicMIG and PassthroughSupport. Must be in a separate setup_file that installs the chart with a different feature gate set.

2.2 Leader Election

Install with controller.leaderElection.enabled=true, controller.replicas=2, resources.computeDomains.enabled=true. Verify:

  • Two controller pods become Ready
  • Exactly one pod's logs contain "successfully acquired lease"

~20 lines of BATS. Single-node is fine — both replicas run on the same node.

Note: Requires resources.computeDomains.enabled=true so the controller Deployment is created. On non-B200 instances, the compute-domains container in the kubelet-plugin will crash (no IMEX), but the controller itself should still run and elect a leader.

Tag: controller

2.3 Webhook Validation

Install with webhook.enabled=true. Verify admission-time rejection of invalid ResourceClaim parameters:

  • Valid opaque config accepted
  • Unknown field rejected (HTTP 422)
  • Invalid sharing strategy rejected

~40 lines of BATS, plus TLS scaffolding. Options:

  • Self-signed cert via openssl in test setup (no external deps)
  • cert-manager (heavier, needs to be installed)

Tag: webhook Blocker: TLS certificate generation in the test harness.

2.4 Controller Prometheus Metrics

Install with compute domains enabled. Curl controller's :8080/metrics. Verify:

  • nvidia_dra_compute_domain_info gauge exists
  • Standard client-go REST metrics present

~15 lines of BATS.

Tag: controller Instance: Cheapest instance where compute domains auto-enable. Currently B200 ($6.99/hr). If the graceful-degradation IMEX fix (already on main) works on all instances, controller metrics should be available anywhere resources.computeDomains.enabled=true is set.


3. New Tests Requiring Specific Hardware

3.1 DynMIG Negative Tests (Multi-GPU MIG Instance)

Two QA-plan tests not yet written:

TC-MIG-DYN-004 (over-capacity): With DynamicMIG=true on A100-40GB, request 8 x 1g.5gb MIG profiles (max is 7). Assert the 8th claim stays Pending with a capacity error.

TC-MIG-DYN-005 (fragmentation): Allocate several 1g.5gb profiles, then request a 3g.20gb profile. Assert it stays Pending due to fragmentation.

~25 lines each. Require a multi-GPU MIG instance (gpu_8x_a100 at $15.92/hr or gpu_8x_h100_sxm5 at $31.92/hr) to avoid the single-GPU A100 "In use by another client" issue.

Tag: dynmig,multi-gpu

3.2 Static MIG on Lambda

Three existing tests in test_gpu_mig.bats that use the nvmm helper to exec into GPU Operator's mig-manager pod. Lambda has no GPU Operator.

What needs to change: Replace the nvmm-based mig_create_1g0_on_node() and mig_ensure_teardown_on_all_nodes() helpers in helpers.sh with direct nvidia-smi mig commands. Options:

  • Run nvidia-smi mig on the host (via SSH from e2e-test.sh, like the MIG pre-cleanup)
  • Run via a privileged pod that has access to /dev/nvidia*
  • Create a tests/bats/lib/lambda/nvmm-direct script that runs nvidia-smi commands on the node

Same "In use" limitation: Single-GPU A100 fails. Use H100/GH200/B200 for single-GPU, or 8x A100 for multi-GPU.

Tag: mig Effort: Medium — the helper functions need rewriting, but the test logic and spec YAMLs already exist.

3.3 Single-Node Compute Domain Tests (8x SXM with IMEX)

Three to five tests from test_cd_imex_chan_inject.bats and test_cd_misc.bats that only need IMEX channels on a single node (not multi-node MNNVL fabric).

Status: Investigated on gpu_8x_a100nvidia-fabricmanager is running but /dev/nvidia-caps-imex-channels/ doesn't exist. IMEX channel device nodes need to be created manually via mknod or the NVreg_CreateImexChannel0 kernel module parameter.

Next step: On the next 8x H100 SXM5 run, test:

modprobe nvidia NVreg_CreateImexChannel0=1
ls /dev/nvidia-caps-imex-channels/

If IMEX channels appear, the single-node CD tests become feasible.

Tag: compute-domain Instance: gpu_8x_h100_sxm5 ($31.92/hr) or gpu_8x_a100 ($15.92/hr)


4. Infrastructure Improvements

4.1 Periodic Prow Job

Currently only a presubmit exists (always_run: true, optional: true). Add a periodic job for tests too slow or expensive for presubmit:

# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-nightly
interval: 6h
GPU_TYPE: ""  # or pin to gpu_1x_h100_sxm5 for DynMIG

This periodic would run the full tests-gpu-single plus upgrade/downgrade and stress tests. Define a new Make target tests-gpu-single-extended or pass additional env vars.

Repo: kubernetes/test-infra Effort: One YAML file.

4.2 GPU-Type-Pinned Periodic Jobs

For weekly multi-GPU and MIG-specific runs:

# ci-dra-driver-nvidia-gpu-e2e-lambda-gpu-weekly-multigpu
interval: 168h  # weekly
GPU_TYPE: "gpu_8x_h100_sxm5"

This would run the full tests-gpu target (which includes static MIG, stress, upgrade/downgrade, and all multi-GPU tests).

Cost: ~$16/run for 30 min on 8x H100.

4.3 B200 Blackwell Validation

Lambda now offers B200 instances (1x at $6.99/hr through 8x at $53.52/hr). These are Blackwell architecture (compute capability 10.0) and the only Lambda instances where compute domains auto-enable.

Action: Run the existing presubmit on gpu_1x_b200_sxm6 to validate:

  • Driver builds and loads on Blackwell
  • Basic GPU allocation works
  • Compute domain controller starts (auto-enabled on B200)
  • Controller metrics are accessible

No new test code needed — just launch with GPU_TYPE=gpu_1x_b200_sxm6.

4.4 Cost-Optimal Instance Selection

Use Case Best Instance $/hr Why
Presubmit (every PR) Any available ("") $1.29-4.29 Cheapest. Covers 15-25 tests depending on GPU.
DynMIG (reliable) gpu_1x_h100_pcie $3.29 Cheapest MIG GPU where DynMIG works reliably.
Multi-GPU (cheapest) gpu_2x_h100_sxm5 $8.38 Only need 2 GPUs for most multi-GPU tests.
Compute domains gpu_1x_b200_sxm6 $6.99 Cheapest B200. CD auto-enables.
Full coverage gpu_8x_h100_sxm5 $31.92 Multi-GPU + MIG + NVSwitch.
arm64 gpu_1x_gh200 $2.29 Only arm64 option.

Note: gpu_2x_h100_sxm5 ($8.38/hr) is better value than gpu_8x_a100 ($15.92/hr) for tests that only need 2+ GPUs.


5. Not Feasible on Lambda

Feature Reason What Would Be Needed
Multi-node CD workloads (nvbandwidth, failover) Needs MNNVL fabric across nodes DGX/HGX cluster
CD failover (force-delete across nodes) Needs multi-node CD Same
CD upgrade/downgrade with running workloads Needs multi-node CD Same
NVLink fabric error handling (CrashOnNVLinkFabricErrors) Needs fabric-attached GPUs with errors Same
VFIO passthrough IOMMU not enabled on Lambda Needs BIOS/kernel IOMMU support
OpenShift 4.21 Different platform OCP cluster
Real XID fault injection Needs hardware error simulation Specialized test hardware
Multi-node MIG management Needs MIG on worker nodes via Operator Multi-node cluster + GPU Operator

6. Prioritized Action Items

# Item Effort New Code? Blocked?
1 Add test_gpu_updowngrade.bats to tests-gpu-single 1 line No No
2 Add test_gpu_stress.bats to tests-gpu-single 1 line No No
3 NVML health smoke test ~15 lines Yes No
4 Leader election test ~20 lines Yes No
5 Create periodic Prow job (nightly) 1 YAML file No No
6 DynMIG negative tests (TC-MIG-DYN-004/005) ~50 lines Yes Needs multi-GPU MIG instance
7 Static MIG nvmm replacement Modify helpers.sh Partial Same "In use" A100 limitation
8 Webhook validation test ~40 lines + TLS Yes TLS scaffolding needed
9 B200 Blackwell validation run 0 code No Instance availability
10 Investigate IMEX channels on 8x H100 5 seconds on instance No Instance availability
11 Create weekly multi-GPU Prow job 1 YAML file No No
12 Extract portable test_basics.bats tests Tag or split file Minimal No
13 Controller metrics test ~15 lines Yes B200 instance
14 Node reboot recovery (#951) ~30 lines Yes Disruptive, periodic only

7. Lambda GPU Instance Reference

Instance GPUs Arch MIG Multi-GPU CD Auto-Enable $/hr
gpu_1x_a10 1x A10 24GB amd64 No No No $1.29
gpu_1x_a100_sxm4 1x A100 40GB amd64 Yes* No No $1.99
gpu_1x_gh200 1x GH200 96GB arm64 Yes No No $2.29
gpu_1x_h100_pcie 1x H100 80GB amd64 Yes No No $3.29
gpu_1x_h100_sxm5 1x H100 80GB amd64 Yes No No $4.29
gpu_1x_b200_sxm6 1x B200 180GB amd64 Yes No Yes $6.99
gpu_8x_v100_n 8x V100 16GB amd64 No Yes No $6.32
gpu_2x_h100_sxm5 2x H100 80GB amd64 Yes Yes No $8.38
gpu_2x_b200_sxm6 2x B200 180GB amd64 Yes Yes Yes $13.78
gpu_8x_a100 8x A100 40GB amd64 Yes Yes No $15.92
gpu_4x_h100_sxm5 4x H100 80GB amd64 Yes Yes No $16.36
gpu_8x_a100_80gb 8x A100 80GB amd64 Yes Yes No $22.32
gpu_4x_b200_sxm6 4x B200 180GB amd64 Yes Yes Yes $27.16
gpu_8x_h100_sxm5 8x H100 80GB amd64 Yes Yes No $31.92
gpu_8x_b200_sxm6 8x B200 180GB amd64 Yes Yes Yes $53.52

* DynMIG fails on single-GPU A100 ("In use by another client"). Works on multi-GPU A100 (8x) and all H100/GH200/B200.


8. Untested Driver Features (Complete List)

Features with zero BATS test coverage anywhere in the repo:

Feature Feature Gate Default What It Does
NVML Device Health NVMLDeviceHealthCheck off XID error monitoring, DeviceTaint creation
VFIO Passthrough PassthroughSupport off GPU passthrough via vfio-pci driver
Device Metadata DeviceMetadata off Generates metadata files for prepared devices
Webhook webhook.enabled off Admission-time validation of ResourceClaim parameters
Leader Election controller.leaderElection.enabled off HA controller with Lease-based leader election
Network Policies *.networkPolicy.enabled off Kubernetes NetworkPolicy for driver pods
Controller Replicas > 1 controller.replicas 1 Multiple controller instances
pprof Profiling controller.metrics.profilePath "" Runtime profiling endpoint

Features with partial coverage (tested in some context but not comprehensively):

Feature What's Tested What's NOT Tested
DynamicMIG Basic allocation, multi-container Over-capacity rejection, fragmentation blocking
ComputeDomainCliques One explicit test in test_cd_misc.bats Full lifecycle, edge cases
IMEXDaemonsWithDNSNames Implicitly used Legacy IP mode never tested
CrashOnNVLinkFabricErrors Default-on but never exercised No test validates crash vs fallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment