Running DRA GPU Tests on Lambda Cloud (Without Prow)

This guide walks you through running the nvidia DRA driver GPU tests on a Lambda Cloud GPU instance, the same way our CI does it — but from your laptop.

Prerequisites

Lambda Cloud API key — sign up at lambdalabs.com, go to Settings > API Keys, create one. Set it:
```
mkdir -p ~/.lambda
echo "YOUR_API_KEY_HERE" > ~/.lambda/api-key
```
Go (1.22+) — needed to install lambdactl
SSH and rsync — should already be on your machine

Two repo checkouts:

# The DRA driver repo
git clone https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu.git
cd dra-driver-nvidia-gpu

# The test-infra repo (for the shared Lambda CI library)
git clone https://github.com/kubernetes/test-infra.git /tmp/test-infra

Step 1: Install lambdactl

GOPROXY=direct go install github.com/dims/lambdactl@latest

Verify it works:

lambdactl types

You should see a table of GPU instance types with availability.

Step 2: Create an SSH key and launch a GPU instance

# Create a temporary SSH key
SSH_DIR=$(mktemp -d /tmp/lambda-ssh.XXXXXX)
SSH_KEY="${SSH_DIR}/key"
ssh-keygen -t ed25519 -f "${SSH_KEY}" -N "" -q

# Register it with Lambda Cloud
SSH_KEY_NAME="my-test-$(date +%s)"
SSH_KEY_ID=$(lambdactl --json ssh-keys add "${SSH_KEY_NAME}" "${SSH_KEY}.pub" | jq -r '.id')
echo "SSH key registered: ${SSH_KEY_ID}"

# Launch an instance (accepts any available GPU type)
# This polls every 15 seconds for up to 10 minutes until a GPU is available.
echo "Waiting for a GPU instance..."
LAUNCH_OUTPUT=$(lambdactl --json watch \
  --ssh "${SSH_KEY_NAME}" \
  --name "${SSH_KEY_NAME}" \
  --interval 15 \
  --timeout 600 \
  --wait-ssh)

INSTANCE_IP=$(echo "${LAUNCH_OUTPUT}" | jq -r '.ip')
INSTANCE_ID=$(echo "${LAUNCH_OUTPUT}" | jq -r '.id')
echo "Instance ready: ${INSTANCE_IP} (${INSTANCE_ID})"

Tip: To request a specific GPU type, add --gpu gpu_1x_a10 (cheapest, ~$1.29/hr). Run lambdactl types to see what's available.

Step 3: Set up SSH helpers

SSH_OPTS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR"

# Quick test
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "nvidia-smi -L"

You should see one or more GPUs listed.

Step 4: Download Kubernetes binaries

K8S_VERSION=$(curl -sSfL https://dl.k8s.io/release/stable.txt)
echo "Using k8s ${K8S_VERSION}"

mkdir -p /tmp/k8s-bins
for bin in kubeadm kubelet kubectl; do
  curl -sSfL "https://dl.k8s.io/release/${K8S_VERSION}/bin/linux/amd64/${bin}" \
    -o "/tmp/k8s-bins/${bin}"
  chmod +x "/tmp/k8s-bins/${bin}"
done

Step 5: Transfer everything to the Lambda instance

# Create the target directory
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "mkdir -p /tmp/k8s-bins"

# Transfer k8s binaries
rsync -a -e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
  /tmp/k8s-bins/ "ubuntu@${INSTANCE_IP}:/tmp/k8s-bins/"

# Transfer the driver repo (exclude large/unnecessary dirs)
rsync -a --exclude=.git --exclude=.claude --exclude=dist --exclude=_output \
  -e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
  ./ "ubuntu@${INSTANCE_IP}:/tmp/dra-driver-nvidia-gpu/"

Step 6: Set up the Kubernetes cluster

This uses the shared setup script from test-infra. It creates a single-node kubeadm cluster with the NVIDIA containerd runtime, CDI enabled, and Docker installed (needed by the BATS test harness).

ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" \
  "env ENABLE_CDI=true ENABLE_DOCKER=true NODE_LABELS=nvidia.com/gpu.present=true bash -s" \
  < /tmp/test-infra/experiment/lambda/lib/setup-k8s-node.sh

This takes 3-5 minutes. When you see === Lambda node setup complete ===, you're good.

Step 7: Build the DRA driver image

ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" 'set -euxo pipefail
cd /tmp/dra-driver-nvidia-gpu
docker buildx use default 2>/dev/null || true
make -f deployments/container/Makefile build DOCKER_BUILD_OPTIONS="--load" CI=true
IMAGE_REF=$(make -f deployments/container/Makefile -s print-IMAGE)
docker save "${IMAGE_REF}" | sudo ctr -n k8s.io images import -
echo "Image loaded: ${IMAGE_REF}"
'

This builds the driver from your local checkout and loads it into containerd so Kubernetes can use it. Takes 2-5 minutes depending on GPU instance specs.

Step 8: Run the tests

# You need GIT_COMMIT_SHORT for the BATS runner image tag.
# If you excluded .git in rsync, compute it locally:
GIT_COMMIT_SHORT=$(git rev-parse --short=8 HEAD)

ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "set -euxo pipefail
cd /tmp/dra-driver-nvidia-gpu

# Build the BATS test runner image
docker buildx use default 2>/dev/null || true
make -f tests/bats/Makefile runner-image GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}

# Run the single-GPU test suite
export KUBECONFIG=\$HOME/.kube/config
export CI=true
export TEST_NVIDIA_DRIVER_ROOT=/
export TEST_CHART_LOCAL=true
export SKIP_CLEANUP=true
export DISABLE_COMPUTE_DOMAINS=true
export TEST_FILTER_TAGS='!multi-gpu,!version-specific'
export GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}

make -f tests/bats/Makefile tests-gpu-single GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}
"

If everything works, you'll see output like:

1..5
ok 1 GPUs: 1 pod(s), 1 full GPU in 6795ms
ok 2 GPUs: 2 pod(s), 1 full GPU (shared, 1 RC) in 3145ms
ok 3 GPUs: 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) in 4860ms
ok 4 GPUs: single GPU runs CUDA demo suite (deviceQuery, vectorAdd, bandwidthTest) in 90542ms
ok 5 GPUs: Job with ResourceClaimTemplate allocates GPUs to completions in 7150ms

Step 9: Clean up (important — you're being billed!)

# Terminate the instance
lambdactl stop "${INSTANCE_ID}" --yes

# Remove the SSH key from Lambda
lambdactl ssh-keys rm "${SSH_KEY_ID}"

# Clean up local temp files
rm -rf "${SSH_DIR}" /tmp/k8s-bins

Troubleshooting

"No regions available for gpu_1x_a10"

Lambda GPUs sell out frequently. Either wait (the watch command polls automatically) or try with --gpu "" to accept any available GPU type.

helm install times out

The iupgrade_wait function in the BATS harness has a 65-second timeout. If the driver image is being pulled from a registry (not pre-loaded), it may take longer. Make sure Step 7 completed successfully.

"nvidia-smi: command not found" inside test pods

CDI is not enabled in containerd. Make sure you passed ENABLE_CDI=true in Step 6.

compute-domains container CrashLoopBackOff

The GPU doesn't have NVLink/NVSwitch support. Set DISABLE_COMPUTE_DOMAINS=true in Step 8 (already included above). This tells the BATS harness to install the chart with compute domains disabled.

Tests pass but artifacts are missing

The BATS output directory is at /tmp/dra-driver-nvidia-gpu/k8s-dra-driver-gpu-tests-out-ubuntu/ on the Lambda instance. You can rsync it back:

rsync -a -e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
  "ubuntu@${INSTANCE_IP}:/tmp/dra-driver-nvidia-gpu/k8s-dra-driver-gpu-tests-out-ubuntu/" \
  ./test-artifacts/

What the tests cover

Test	What it does
1 pod, 1 full GPU	Allocates a single GPU via DRA ResourceClaim, runs `nvidia-smi -L`
2 pods, shared GPU	Two pods share one GPU via a single ResourceClaim
2 containers, shared GPU	One pod with two containers sharing a GPU via ResourceClaimTemplate
CUDA demo suite	Installs and runs `deviceQuery`, `vectorAdd`, `bandwidthTest` from the CUDA demo suite
Job with RCT	A Kubernetes Job with 2 completions (parallelism=1), each getting a GPU via ResourceClaimTemplate

How this maps to CI

In Prow CI, the same flow is automated by:

hack/ci/lambda/e2e-test.sh in the driver repo (orchestrator)
experiment/lambda/lib/lambda-common.sh in test-infra (Lambda lifecycle helpers)
experiment/lambda/lib/setup-k8s-node.sh in test-infra (cluster setup)

The Prow job pull-dra-driver-nvidia-gpu-e2e-lambda-gpu runs this on every PR.

dims/lambda-gpu-testing-guide.md

Select an option

No results found