Skip to content

Instantly share code, notes, and snippets.

@josecastillolema
Last active June 9, 2026 15:59
Show Gist options
  • Select an option

  • Save josecastillolema/c5f66f993e8eec3a6f4cfba2cdabe2d9 to your computer and use it in GitHub Desktop.

Select an option

Save josecastillolema/c5f66f993e8eec3a6f4cfba2cdabe2d9 to your computer and use it in GitHub Desktop.
MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

�]777;container;pop;;�\MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Date: 2026-06-05 Branch: main (commit 68954b2) Cluster: (AWS us-west-2)

Versions

Component Version
OpenShift 4.21.14
Kubernetes v1.34.6
RHCOS 9.6.20260504-0 (Plow)
Kernel 5.14.0-570.112.1.el9_6.x86_64
CRI-O 1.34.7-2.rhaos4.21
GPU Operator 25.3.4 (gpu-operator-certified.v25.3.4)
NFD Operator 4.21.0-202605260453
NVIDIA Driver 580.82.07
CUDA Runtime 13.0
GPU Model NVIDIA A100-SXM4-40GB
Instance Type p4d.24xlarge (8x A100)
Go 1.25.5
Ginkgo 2.28.1

Test Flow

Phase 1 — Deploy GPU Operator

The GPU Operator deploy test (tests/nvidiagpu/deploygpu_test.go) was run first to install NFD and the GPU Operator on a clean cluster. It was initially run with NVIDIAGPU_CLEANUP=true (default), which cleaned up everything after the test. A second manual deployment was needed with cleanup disabled.

# First run (with cleanup — operators removed after test):
export KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=nvidiagpu
export TEST_LABELS="nvidia-ci,gpu"
export TEST_VERBOSE=true
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
  --label-filter="nvidia-ci,gpu" ./tests/nvidiagpu

Result: PASSED (1 passed, 3 skipped). Duration: ~20 minutes.

Since NVIDIAGPU_CLEANUP=true (default) removed the operators, the GPU Operator was re-deployed manually for Phase 2:

# Create namespaces, OperatorGroups, Subscriptions
oc create namespace openshift-nfd
oc create namespace nvidia-gpu-operator
# ... (OperatorGroups and Subscriptions for NFD and GPU Operator via OLM)
# Create ClusterPolicy from CSV ALM examples
oc get csv -n nvidia-gpu-operator -o jsonpath='{.items[0].metadata.annotations.alm-examples}' \
  | python3 -c "..." > /tmp/clusterpolicy.json
oc apply -f /tmp/clusterpolicy.json

Phase 2 — Run MIG Tests (single-mig)

export KUBECONFIG=/home/jose/.kube/config
export TEST_FEATURES=mig
export TEST_LABELS="nvidia-ci,mig,single-mig"
export TEST_VERBOSE=true
export NVIDIAGPU_CLEANUP=false
ginkgo -timeout=24h --keep-going --require-suite -r -vv \
  --label-filter="nvidia-ci,mig,single-mig" \
  ./tests/mig -- --single.mig.profile=-1

Result: PASSED (1 passed, 1 skipped). Duration: ~11m36s.

What the single-mig test does

  1. BeforeAll: Parses CLI parameters, reports OCP version, ensures NFD is installed
  2. Check MIG capability: Waits for nvidia.com/mig.capable=true label on GPU nodes
  3. Pull ClusterPolicy: Retrieves existing gpu-cluster-policy
  4. Configure MIG strategy: Sets nvidia.com/mig.config label with the selected profile, configures single MIG strategy in ClusterPolicy
  5. Wait for reconfiguration: ClusterPolicy goes notReady → ready (~7 min)
  6. Query MIG profiles: Runs nvidia-smi mig -lgip inside the driver pod to discover available profiles
  7. Select profile: --single.mig.profile=-1 means random selection; test selected 2g.10gb
  8. Deploy gpu-burn workload: Creates namespace, configmap, and gpu-burn pod targeting MIG instances
  9. Validate: Waits for gpu-burn to complete (300s burn), parses logs for GPU X: OK
  10. Cleanup: Resets MIG labels to all-disabled, deletes workload resources

MIG profile used

  • Profile: 2g.10gb (randomly selected from A100 available profiles)
  • Instances: 3 MIG instances across the GPUs
  • Performance: ~4.9 Tflop/s per instance, 0 errors
  • Temperatures: 37–40°C

Issues Encountered

1. KUBECONFIG must be set explicitly

The test framework (inittools) requires the KUBECONFIG environment variable to be set. It does not fall back to ~/.kube/config like oc/kubectl do.

2. NFD label format mismatch (pci-10de.present) — user error

The MIG test timed out waiting for feature.node.kubernetes.io/pci-10de.present=true on GPU nodes. The node instead had feature.node.kubernetes.io/pci-0302_10de.present=true.

Root cause: User error when manually deploying the NFD CR. The manual CR used workerConfig.configData: "", which causes NFD to use its default deviceLabelFields: [class, vendor] — producing pci-0302_10de.present (class+vendor format). The test suite deploys the NFD CR from the CSV's ALM examples, which include the NVIDIA-recommended config with deviceLabelFields: [vendor] — producing the expected pci-10de.present.

Lesson: When manually deploying NFD, always include the proper workerConfig:

spec:
  workerConfig:
    configData: |
      sources:
        pci:
          deviceClassWhitelist:
            - "02"
            - "03"
            - "0200"
            - "0207"
          deviceLabelFields:
            - vendor

Or better yet, deploy from the CSV's ALM examples as the test suite does.

Workaround used: Manually added the expected label:

oc label node <node> feature.node.kubernetes.io/pci-10de.present=true

3. NVIDIAGPU_CLEANUP=true (default) removes operators after deploy test

When running the deploy test followed by MIG tests, set NVIDIAGPU_CLEANUP=false to keep the GPU Operator installed, or deploy manually between phases.

4. Ginkgo version mismatch

The system-installed Ginkgo (2.29.0) didn't match the vendored version (2.28.1). Fixed by installing from vendor:

go install ./vendor/github.com/onsi/ginkgo/v2/ginkgo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment