�]777;container;pop;;�\MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs
MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs
Date: 2026-06-05
Branch: main (commit 68954b2)
Cluster: (AWS us-west-2)
| Component | Version |
|---|---|
| OpenShift | 4.21.14 |
| Kubernetes | v1.34.6 |
| RHCOS | 9.6.20260504-0 (Plow) |
| Kernel | 5.14.0-570.112.1.el9_6.x86_64 |
| CRI-O | 1.34.7-2.rhaos4.21 |
| GPU Operator | 25.3.4 (gpu-operator-certified.v25.3.4) |
| NFD Operator | 4.21.0-202605260453 |
| NVIDIA Driver | 580.82.07 |
| CUDA Runtime | 13.0 |
| GPU Model | NVIDIA A100-SXM4-40GB |
| Instance Type | p4d.24xlarge (8x A100) |
| Go | 1.25.5 |
| Ginkgo | 2.28.1 |
The GPU Operator deploy test (tests/nvidiagpu/deploygpu_test.go) was run first to install
NFD and the GPU Operator on a clean cluster. It was initially run with NVIDIAGPU_CLEANUP=true
(default), which cleaned up everything after the test. A second manual deployment was needed
with cleanup disabled.
# First run (with cleanup — operators removed after test):
export KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=nvidiagpu
export TEST_LABELS="nvidia-ci,gpu"
export TEST_VERBOSE=true
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
--label-filter="nvidia-ci,gpu" ./tests/nvidiagpuResult: PASSED (1 passed, 3 skipped). Duration: ~20 minutes.
Since NVIDIAGPU_CLEANUP=true (default) removed the operators, the GPU Operator was
re-deployed manually for Phase 2:
# Create namespaces, OperatorGroups, Subscriptions
oc create namespace openshift-nfd
oc create namespace nvidia-gpu-operator
# ... (OperatorGroups and Subscriptions for NFD and GPU Operator via OLM)
# Create ClusterPolicy from CSV ALM examples
oc get csv -n nvidia-gpu-operator -o jsonpath='{.items[0].metadata.annotations.alm-examples}' \
| python3 -c "..." > /tmp/clusterpolicy.json
oc apply -f /tmp/clusterpolicy.jsonexport KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=mig
export TEST_LABELS="nvidia-ci,mig,single-mig"
export TEST_VERBOSE=true
export NVIDIAGPU_CLEANUP=false
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
--label-filter="nvidia-ci,mig,single-mig" \
./tests/mig -- --single.mig.profile=-1Result: PASSED (1 passed, 1 skipped). Duration: ~11m36s.
- BeforeAll: Parses CLI parameters, reports OCP version, ensures NFD is installed
- Check MIG capability: Waits for
nvidia.com/mig.capable=truelabel on GPU nodes - Pull ClusterPolicy: Retrieves existing
gpu-cluster-policy - Configure MIG strategy: Sets
nvidia.com/mig.configlabel with the selected profile, configuressingleMIG strategy in ClusterPolicy - Wait for reconfiguration: ClusterPolicy goes notReady → ready (~7 min)
- Query MIG profiles: Runs
nvidia-smi mig -lgipinside the driver pod to discover available profiles - Select profile:
--single.mig.profile=-1means random selection; test selected2g.10gb - Deploy gpu-burn workload: Creates namespace, configmap, and gpu-burn pod targeting MIG instances
- Validate: Waits for gpu-burn to complete (300s burn), parses logs for
GPU X: OK - Cleanup: Resets MIG labels to
all-disabled, deletes workload resources
- Profile:
2g.10gb(randomly selected from A100 available profiles) - Instances: 3 MIG instances across the GPUs
- Performance: ~4.9 Tflop/s per instance, 0 errors
- Temperatures: 37–40°C
The test framework (inittools) requires the KUBECONFIG environment variable to be set.
It does not fall back to ~/.kube/config like oc/kubectl do.
The MIG test timed out waiting for feature.node.kubernetes.io/pci-10de.present=true on GPU
nodes. The node instead had feature.node.kubernetes.io/pci-0302_10de.present=true.
Root cause: User error when manually deploying the NFD CR. The manual CR used
workerConfig.configData: "", which causes NFD to use its default deviceLabelFields: [class, vendor] — producing pci-0302_10de.present (class+vendor format). The test suite
deploys the NFD CR from the CSV's ALM examples, which include the
NVIDIA-recommended config
with deviceLabelFields: [vendor] — producing the expected pci-10de.present.
Lesson: When manually deploying NFD, always include the proper workerConfig:
spec:
workerConfig:
configData: |
sources:
pci:
deviceClassWhitelist:
- "02"
- "03"
- "0200"
- "0207"
deviceLabelFields:
- vendorOr better yet, deploy from the CSV's ALM examples as the test suite does.
Workaround used: Manually added the expected label:
oc label node <node> feature.node.kubernetes.io/pci-10de.present=trueWhen running the deploy test followed by MIG tests, set NVIDIAGPU_CLEANUP=false to keep
the GPU Operator installed, or deploy manually between phases.
The system-installed Ginkgo (2.29.0) didn't match the vendored version (2.28.1). Fixed by installing from vendor:
go install ./vendor/github.com/onsi/ginkgo/v2/ginkgo