José Castillo Lema josecastillolema

Deploying MicroShift on Ubuntu 24.04 (baseline x86-64 CPU)

This documents deploying MicroShift on an Ubuntu 24.04 LTS VM with a QEMU virtual CPU that only supports baseline x86-64 (no x86-64-v2).

Environment

OS: Ubuntu 24.04.3 LTS (Noble Numbat)
CPU: QEMU Virtual CPU version 2.5+ (4 vCPUs, baseline x86-64 only — no SSSE3/SSE4)
RAM: 8 GB
Disk: 22 GB (17 GB free)

�]777;container;pop;;�\MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Date: 2026-06-05 Branch: main (commit 68954b2) Cluster: (AWS us-west-2)

Failed Experiment: dma-buf GPU Memory Support for Soft-RoCE (rxe)

Goal

Enable GPUDirect RDMA on software RoCE (rxe) by implementing reg_user_mr_dmabuf in the rxe kernel driver and the corresponding userspace provider in rdma-core. This would allow ib_write_bw --use_cuda=0 --use_cuda_dmabuf to work with rxe devices — no Mellanox NIC required.

What was built

Kernel side (3 files, ~115 lines)

GPU Time-Slicing on OpenShift

Configure NVIDIA GPU time-slicing to share a single physical GPU across multiple pods. This guide assumes NFD and the GPU Operator are already deployed with a working ClusterPolicy.

Prerequisites

Verify the GPU Operator is ready and a GPU node is available:

oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}'

Deploying the GPU Operator with Precompiled Signed Drivers on OpenShift (AWS Secure Boot)

Prerequisites

An OpenShift cluster on AWS with UEFI Secure Boot enabled and NFD (Node Feature Discovery) deployed. See Enabling UEFI Secure Boot for OCP Workers on AWS for the setup procedure.
GPU worker nodes (e.g., g4dn.xlarge) labeled by NFD with feature.node.kubernetes.io/pci-10de.present=true.

Identifying the Correct Precompiled Driver Image

Precompiled driver images are kernel-specific. First, determine the kernel version running on your RHCOS nodes:

Simulating GPU Hardware Failure on OpenShift with NVIDIA GPU Operator

TL;DR

When a GPU fails in a way that does not produce a critical XID error in the kernel (e.g., XIDs 48, 74, 79, 94, 95), the NVIDIA device plugin does not detect the failure. The node continues advertising nvidia.com/gpu: 1, no taint or cordon is applied, and the scheduler keeps sending GPU pods to the broken node. Those pods start, fail with CUDA errors (invalid device ordinal), and exit — but the node remains eligible for more GPU workloads. This document reproduces the issue end-to-end on an OpenShift 4.21 cluster on AWS using a g4dn.xlarge instance (NVIDIA Tesla T4).

The failure was simulated by removing the GPU from the PCI bus:

echo 1 &gt; /sys/bus/pci/devices/0000:00:1e.0/remove

Enabling UEFI Secure Boot for OCP Workers on AWS

AWS does not publish any Linux AMI (including RHCOS) with UEFI Secure Boot enabled. The stock RHCOS AMIs ship with legacy-bios boot mode. Since RHCOS uses a unified BIOS/UEFI partition layout (since OCP 4.3), the same disk image can boot in UEFI mode -- we just need to re-register the AMI with the correct boot mode and a UEFI variable store containing Secure Boot keys.

This only applies to Nitro-based virtualized instance types (e.g. m5, m6i, g4dn). Bare metal instances do not support UEFI Secure Boot. Note that G4ad instances are

Name	Age	Blood
Mary Clan	82	A+
Jane Doe	46	AB+
Ted Coe	50	B-
Bill Murray	44	A-
Clint East	52	B+
Spencer Tracy	79	B+
Bill Ted	56	0+
Bob Smith	66	A-
Bill Cosby	81	A+

K8s / OpenShift

One liners

Create a debug pod:

kubectl run -it --tty --rm debug --image=alpine --restart=Never -- sh -n <namespace>

Create a pod: