�]777;container;pop;;�\MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Date: 2026-06-05 Branch: main (commit 68954b2) Cluster: (AWS us-west-2)

Versions

Component	Version
OpenShift	4.21.14
Kubernetes	v1.34.6
RHCOS	9.6.20260504-0 (Plow)
Kernel	5.14.0-570.112.1.el9_6.x86_64
CRI-O	1.34.7-2.rhaos4.21
GPU Operator	25.3.4 (`gpu-operator-certified.v25.3.4`)
NFD Operator	4.21.0-202605260453
NVIDIA Driver	580.82.07
CUDA Runtime	13.0
GPU Model	NVIDIA A100-SXM4-40GB
Instance Type	p4d.24xlarge (8x A100)
Go	1.25.5
Ginkgo	2.28.1

Test Flow

Phase 1 — Deploy GPU Operator

The GPU Operator deploy test (tests/nvidiagpu/deploygpu_test.go) was run first to install NFD and the GPU Operator on a clean cluster. It was initially run with NVIDIAGPU_CLEANUP=true (default), which cleaned up everything after the test. A second manual deployment was needed with cleanup disabled.

# First run (with cleanup — operators removed after test):
export KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=nvidiagpu
export TEST_LABELS="nvidia-ci,gpu"
export TEST_VERBOSE=true
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
  --label-filter="nvidia-ci,gpu" ./tests/nvidiagpu

Result: PASSED (1 passed, 3 skipped). Duration: ~20 minutes.

Since NVIDIAGPU_CLEANUP=true (default) removed the operators, the GPU Operator was re-deployed manually for Phase 2:

# Create namespaces, OperatorGroups, Subscriptions
oc create namespace openshift-nfd
oc create namespace nvidia-gpu-operator
# ... (OperatorGroups and Subscriptions for NFD and GPU Operator via OLM)
# Create ClusterPolicy from CSV ALM examples
oc get csv -n nvidia-gpu-operator -o jsonpath='{.items[0].metadata.annotations.alm-examples}' \
  | python3 -c "..." > /tmp/clusterpolicy.json
oc apply -f /tmp/clusterpolicy.json

Phase 2 — Run MIG Tests (single-mig)

export KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=mig
export TEST_LABELS="nvidia-ci,mig,single-mig"
export TEST_VERBOSE=true
export NVIDIAGPU_CLEANUP=false
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
  --label-filter="nvidia-ci,mig,single-mig" \
  ./tests/mig -- --single.mig.profile=-1

Result: PASSED (1 passed, 1 skipped). Duration: ~11m36s.

What the single-mig test does

BeforeAll: Parses CLI parameters, reports OCP version, ensures NFD is installed
Check MIG capability: Waits for nvidia.com/mig.capable=true label on GPU nodes
Pull ClusterPolicy: Retrieves existing gpu-cluster-policy
Configure MIG strategy: Sets nvidia.com/mig.config label with the selected profile, configures single MIG strategy in ClusterPolicy
Wait for reconfiguration: ClusterPolicy goes notReady → ready (~7 min)
Query MIG profiles: Runs nvidia-smi mig -lgip inside the driver pod to discover available profiles
Select profile: --single.mig.profile=-1 means random selection; test selected 2g.10gb
Deploy gpu-burn workload: Creates namespace, configmap, and gpu-burn pod targeting MIG instances
Validate: Waits for gpu-burn to complete (300s burn), parses logs for GPU X: OK
Cleanup: Resets MIG labels to all-disabled, deletes workload resources

MIG profile used

Profile: 2g.10gb (randomly selected from A100 available profiles)
Instances: 3 MIG instances across the GPUs
Performance: ~4.9 Tflop/s per instance, 0 errors
Temperatures: 37–40°C

Issues Encountered

1. `KUBECONFIG` must be set explicitly

The test framework (inittools) requires the KUBECONFIG environment variable to be set. It does not fall back to ~/.kube/config like oc/kubectl do.

2. NFD label format mismatch (`pci-10de.present`) — user error

The MIG test timed out waiting for feature.node.kubernetes.io/pci-10de.present=true on GPU nodes. The node instead had feature.node.kubernetes.io/pci-0302_10de.present=true.

Root cause: User error when manually deploying the NFD CR. The manual CR used workerConfig.configData: "", which causes NFD to use its default deviceLabelFields: [class, vendor] — producing pci-0302_10de.present (class+vendor format). The test suite deploys the NFD CR from the CSV's ALM examples, which include the NVIDIA-recommended config with deviceLabelFields: [vendor] — producing the expected pci-10de.present.

Lesson: When manually deploying NFD, always include the proper workerConfig:

spec:
  workerConfig:
    configData: |
      sources:
        pci:
          deviceClassWhitelist:
            - "02"
            - "03"
            - "0200"
            - "0207"
          deviceLabelFields:
            - vendor

Or better yet, deploy from the CSV's ALM examples as the test suite does.

Workaround used: Manually added the expected label:

oc label node <node> feature.node.kubernetes.io/pci-10de.present=true

3. `NVIDIAGPU_CLEANUP=true` (default) removes operators after deploy test

When running the deploy test followed by MIG tests, set NVIDIAGPU_CLEANUP=false to keep the GPU Operator installed, or deploy manually between phases.

4. Ginkgo version mismatch

The system-installed Ginkgo (2.29.0) didn't match the vendored version (2.28.1). Fixed by installing from vendor:

go install ./vendor/github.com/onsi/ginkgo/v2/ginkgo

josecastillolema/mig-test-run.md

Select an option

No results found

Select an option

No results found

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Versions

Test Flow

Phase 1 — Deploy GPU Operator

Phase 2 — Run MIG Tests (single-mig)

What the single-mig test does

MIG profile used

Issues Encountered

1. `KUBECONFIG` must be set explicitly

2. NFD label format mismatch (`pci-10de.present`) — user error

3. `NVIDIAGPU_CLEANUP=true` (default) removes operators after deploy test

4. Ginkgo version mismatch

josecastillolema/mig-test-run.md

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Versions

Test Flow

Phase 1 — Deploy GPU Operator

Phase 2 — Run MIG Tests (single-mig)

What the single-mig test does

MIG profile used

Issues Encountered

1. KUBECONFIG must be set explicitly

2. NFD label format mismatch (pci-10de.present) — user error

3. NVIDIAGPU_CLEANUP=true (default) removes operators after deploy test

4. Ginkgo version mismatch

1. `KUBECONFIG` must be set explicitly

2. NFD label format mismatch (`pci-10de.present`) — user error

3. `NVIDIAGPU_CLEANUP=true` (default) removes operators after deploy test