This guide walks you through running the nvidia DRA driver GPU tests on a Lambda Cloud GPU instance, the same way our CI does it — but from your laptop.
-
Lambda Cloud API key — sign up at lambdalabs.com, go to Settings > API Keys, create one. Set it:
mkdir -p ~/.lambda echo "YOUR_API_KEY_HERE" > ~/.lambda/api-key
-
Go (1.22+) — needed to install
lambdactl -
SSH and rsync — should already be on your machine
-
Two repo checkouts:
# The DRA driver repo git clone https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu.git cd dra-driver-nvidia-gpu # The test-infra repo (for the shared Lambda CI library) git clone https://github.com/kubernetes/test-infra.git /tmp/test-infra
GOPROXY=direct go install github.com/dims/lambdactl@latestVerify it works:
lambdactl typesYou should see a table of GPU instance types with availability.
# Create a temporary SSH key
SSH_DIR=$(mktemp -d /tmp/lambda-ssh.XXXXXX)
SSH_KEY="${SSH_DIR}/key"
ssh-keygen -t ed25519 -f "${SSH_KEY}" -N "" -q
# Register it with Lambda Cloud
SSH_KEY_NAME="my-test-$(date +%s)"
SSH_KEY_ID=$(lambdactl --json ssh-keys add "${SSH_KEY_NAME}" "${SSH_KEY}.pub" | jq -r '.id')
echo "SSH key registered: ${SSH_KEY_ID}"
# Launch an instance (accepts any available GPU type)
# This polls every 15 seconds for up to 10 minutes until a GPU is available.
echo "Waiting for a GPU instance..."
LAUNCH_OUTPUT=$(lambdactl --json watch \
--ssh "${SSH_KEY_NAME}" \
--name "${SSH_KEY_NAME}" \
--interval 15 \
--timeout 600 \
--wait-ssh)
INSTANCE_IP=$(echo "${LAUNCH_OUTPUT}" | jq -r '.ip')
INSTANCE_ID=$(echo "${LAUNCH_OUTPUT}" | jq -r '.id')
echo "Instance ready: ${INSTANCE_IP} (${INSTANCE_ID})"Tip: To request a specific GPU type, add
--gpu gpu_1x_a10(cheapest, ~$1.29/hr). Runlambdactl typesto see what's available.
SSH_OPTS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR"
# Quick test
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "nvidia-smi -L"You should see one or more GPUs listed.
K8S_VERSION=$(curl -sSfL https://dl.k8s.io/release/stable.txt)
echo "Using k8s ${K8S_VERSION}"
mkdir -p /tmp/k8s-bins
for bin in kubeadm kubelet kubectl; do
curl -sSfL "https://dl.k8s.io/release/${K8S_VERSION}/bin/linux/amd64/${bin}" \
-o "/tmp/k8s-bins/${bin}"
chmod +x "/tmp/k8s-bins/${bin}"
done# Create the target directory
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "mkdir -p /tmp/k8s-bins"
# Transfer k8s binaries
rsync -a -e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
/tmp/k8s-bins/ "ubuntu@${INSTANCE_IP}:/tmp/k8s-bins/"
# Transfer the driver repo (exclude large/unnecessary dirs)
rsync -a --exclude=.git --exclude=.claude --exclude=dist --exclude=_output \
-e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
./ "ubuntu@${INSTANCE_IP}:/tmp/dra-driver-nvidia-gpu/"This uses the shared setup script from test-infra. It creates a single-node kubeadm cluster with the NVIDIA containerd runtime, CDI enabled, and Docker installed (needed by the BATS test harness).
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" \
"env ENABLE_CDI=true ENABLE_DOCKER=true NODE_LABELS=nvidia.com/gpu.present=true bash -s" \
< /tmp/test-infra/experiment/lambda/lib/setup-k8s-node.shThis takes 3-5 minutes. When you see === Lambda node setup complete ===, you're good.
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" 'set -euxo pipefail
cd /tmp/dra-driver-nvidia-gpu
docker buildx use default 2>/dev/null || true
make -f deployments/container/Makefile build DOCKER_BUILD_OPTIONS="--load" CI=true
IMAGE_REF=$(make -f deployments/container/Makefile -s print-IMAGE)
docker save "${IMAGE_REF}" | sudo ctr -n k8s.io images import -
echo "Image loaded: ${IMAGE_REF}"
'This builds the driver from your local checkout and loads it into containerd so Kubernetes can use it. Takes 2-5 minutes depending on GPU instance specs.
# You need GIT_COMMIT_SHORT for the BATS runner image tag.
# If you excluded .git in rsync, compute it locally:
GIT_COMMIT_SHORT=$(git rev-parse --short=8 HEAD)
ssh ${SSH_OPTS} -i "${SSH_KEY}" "ubuntu@${INSTANCE_IP}" "set -euxo pipefail
cd /tmp/dra-driver-nvidia-gpu
# Build the BATS test runner image
docker buildx use default 2>/dev/null || true
make -f tests/bats/Makefile runner-image GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}
# Run the single-GPU test suite
export KUBECONFIG=\$HOME/.kube/config
export CI=true
export TEST_NVIDIA_DRIVER_ROOT=/
export TEST_CHART_LOCAL=true
export SKIP_CLEANUP=true
export DISABLE_COMPUTE_DOMAINS=true
export TEST_FILTER_TAGS='!multi-gpu,!version-specific'
export GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}
make -f tests/bats/Makefile tests-gpu-single GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}
"If everything works, you'll see output like:
1..5
ok 1 GPUs: 1 pod(s), 1 full GPU in 6795ms
ok 2 GPUs: 2 pod(s), 1 full GPU (shared, 1 RC) in 3145ms
ok 3 GPUs: 1 pod(s), 2 cntrs, 1 full GPU (shared, 1 RCT) in 4860ms
ok 4 GPUs: single GPU runs CUDA demo suite (deviceQuery, vectorAdd, bandwidthTest) in 90542ms
ok 5 GPUs: Job with ResourceClaimTemplate allocates GPUs to completions in 7150ms
# Terminate the instance
lambdactl stop "${INSTANCE_ID}" --yes
# Remove the SSH key from Lambda
lambdactl ssh-keys rm "${SSH_KEY_ID}"
# Clean up local temp files
rm -rf "${SSH_DIR}" /tmp/k8s-binsLambda GPUs sell out frequently. Either wait (the watch command polls automatically)
or try with --gpu "" to accept any available GPU type.
The iupgrade_wait function in the BATS harness has a 65-second timeout. If the
driver image is being pulled from a registry (not pre-loaded), it may take longer.
Make sure Step 7 completed successfully.
CDI is not enabled in containerd. Make sure you passed ENABLE_CDI=true in Step 6.
The GPU doesn't have NVLink/NVSwitch support. Set DISABLE_COMPUTE_DOMAINS=true
in Step 8 (already included above). This tells the BATS harness to install the
chart with compute domains disabled.
The BATS output directory is at /tmp/dra-driver-nvidia-gpu/k8s-dra-driver-gpu-tests-out-ubuntu/
on the Lambda instance. You can rsync it back:
rsync -a -e "ssh ${SSH_OPTS} -i ${SSH_KEY}" \
"ubuntu@${INSTANCE_IP}:/tmp/dra-driver-nvidia-gpu/k8s-dra-driver-gpu-tests-out-ubuntu/" \
./test-artifacts/| Test | What it does |
|---|---|
| 1 pod, 1 full GPU | Allocates a single GPU via DRA ResourceClaim, runs nvidia-smi -L |
| 2 pods, shared GPU | Two pods share one GPU via a single ResourceClaim |
| 2 containers, shared GPU | One pod with two containers sharing a GPU via ResourceClaimTemplate |
| CUDA demo suite | Installs and runs deviceQuery, vectorAdd, bandwidthTest from the CUDA demo suite |
| Job with RCT | A Kubernetes Job with 2 completions (parallelism=1), each getting a GPU via ResourceClaimTemplate |
In Prow CI, the same flow is automated by:
hack/ci/lambda/e2e-test.shin the driver repo (orchestrator)experiment/lambda/lib/lambda-common.shin test-infra (Lambda lifecycle helpers)experiment/lambda/lib/setup-k8s-node.shin test-infra (cluster setup)
The Prow job pull-dra-driver-nvidia-gpu-e2e-lambda-gpu runs this on every PR.