You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Failed experiment: dma-buf GPU memory support for Soft-RoCE (rxe)
Failed Experiment: dma-buf GPU Memory Support for Soft-RoCE (rxe)
Goal
Enable GPUDirect RDMA on software RoCE (rxe) by implementing reg_user_mr_dmabuf in the rxe kernel driver and the corresponding userspace provider in rdma-core. This would allow ib_write_bw --use_cuda=0 --use_cuda_dmabuf to work with rxe devices — no Mellanox NIC required.
GPU Time-Slicing on OpenShift — manual setup guide
GPU Time-Slicing on OpenShift
Configure NVIDIA GPU time-slicing to share a single physical GPU across multiple pods. This guide assumes NFD and the GPU Operator are already deployed with a working ClusterPolicy.
Prerequisites
Verify the GPU Operator is ready and a GPU node is available:
oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}'
Simulating GPU Hardware Failure on OpenShift with NVIDIA GPU Operator
Simulating GPU Hardware Failure on OpenShift with NVIDIA GPU Operator
TL;DR
When a GPU fails in a way that does not produce a critical XID error in the kernel (e.g., XIDs 48, 74, 79, 94, 95), the NVIDIA device plugin does not detect the failure. The node continues advertising nvidia.com/gpu: 1, no taint or cordon is applied, and the scheduler keeps sending GPU pods to the broken node. Those pods start, fail with CUDA errors (invalid device ordinal), and exit — but the node remains eligible for more GPU workloads. This document reproduces the issue end-to-end on an OpenShift 4.21 cluster on AWS using a g4dn.xlarge instance (NVIDIA Tesla T4).
The failure was simulated by removing the GPU from the PCI bus:
AWS does not publish any Linux AMI (including RHCOS) with UEFI Secure Boot enabled.
The stock RHCOS AMIs ship with legacy-bios boot mode. Since RHCOS uses a unified
BIOS/UEFI partition layout (since OCP 4.3), the same disk image can boot in UEFI mode --
we just need to re-register the AMI with the correct boot mode and a UEFI variable store
containing Secure Boot keys.
This only applies to Nitro-based virtualized instance types (e.g. m5, m6i, g4dn).
Bare metal instances do not support UEFI Secure Boot. Note that G4ad instances are
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters