Skip to content

Instantly share code, notes, and snippets.

@josecastillolema
josecastillolema / microshift-deployment.md
Last active June 10, 2026 09:24
Deploying MicroShift on Ubuntu 24.04 with baseline x86-64 CPU (no x86-64-v2)

Deploying MicroShift on Ubuntu 24.04 (baseline x86-64 CPU)

This documents deploying MicroShift on an Ubuntu 24.04 LTS VM with a QEMU virtual CPU that only supports baseline x86-64 (no x86-64-v2).

Environment

  • OS: Ubuntu 24.04.3 LTS (Noble Numbat)
  • CPU: QEMU Virtual CPU version 2.5+ (4 vCPUs, baseline x86-64 only — no SSSE3/SSE4)
  • RAM: 8 GB
  • Disk: 22 GB (17 GB free)
@josecastillolema
josecastillolema / mig-test-run.md
Last active June 9, 2026 15:59
MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

�]777;container;pop;;�\MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) test run documentation — nvidia-ci on OCP 4.21 with A100 GPUs

MIG (Multi-Instance GPU) Test Run — nvidia-ci

Date: 2026-06-05 Branch: main (commit 68954b2) Cluster: (AWS us-west-2)

@josecastillolema
josecastillolema / rxe-dmabuf-experiment.md
Last active June 3, 2026 14:55
Failed experiment: dma-buf GPU memory support for Soft-RoCE (rxe)

Failed Experiment: dma-buf GPU Memory Support for Soft-RoCE (rxe)

Goal

Enable GPUDirect RDMA on software RoCE (rxe) by implementing reg_user_mr_dmabuf in the rxe kernel driver and the corresponding userspace provider in rdma-core. This would allow ib_write_bw --use_cuda=0 --use_cuda_dmabuf to work with rxe devices — no Mellanox NIC required.

What was built

Kernel side (3 files, ~115 lines)

@josecastillolema
josecastillolema / gpu-time-slicing.md
Created June 2, 2026 08:54
GPU Time-Slicing on OpenShift — manual setup guide

GPU Time-Slicing on OpenShift

Configure NVIDIA GPU time-slicing to share a single physical GPU across multiple pods. This guide assumes NFD and the GPU Operator are already deployed with a working ClusterPolicy.

Prerequisites

Verify the GPU Operator is ready and a GPU node is available:

oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}'
@josecastillolema
josecastillolema / gpu-operator-precompiled-drivers.md
Last active May 27, 2026 12:42
Deploying the GPU Operator with Precompiled Signed Drivers on OpenShift (AWS Secure Boot)

Deploying the GPU Operator with Precompiled Signed Drivers on OpenShift (AWS Secure Boot)

Prerequisites

  • An OpenShift cluster on AWS with UEFI Secure Boot enabled and NFD (Node Feature Discovery) deployed. See Enabling UEFI Secure Boot for OCP Workers on AWS for the setup procedure.
  • GPU worker nodes (e.g., g4dn.xlarge) labeled by NFD with feature.node.kubernetes.io/pci-10de.present=true.

Identifying the Correct Precompiled Driver Image

Precompiled driver images are kernel-specific. First, determine the kernel version running on your RHCOS nodes:

@josecastillolema
josecastillolema / gpu-failure-simulation.md
Last active May 19, 2026 10:13
Simulating GPU Hardware Failure on OpenShift with NVIDIA GPU Operator

Simulating GPU Hardware Failure on OpenShift with NVIDIA GPU Operator

TL;DR

When a GPU fails in a way that does not produce a critical XID error in the kernel (e.g., XIDs 48, 74, 79, 94, 95), the NVIDIA device plugin does not detect the failure. The node continues advertising nvidia.com/gpu: 1, no taint or cordon is applied, and the scheduler keeps sending GPU pods to the broken node. Those pods start, fail with CUDA errors (invalid device ordinal), and exit — but the node remains eligible for more GPU workloads. This document reproduces the issue end-to-end on an OpenShift 4.21 cluster on AWS using a g4dn.xlarge instance (NVIDIA Tesla T4).

The failure was simulated by removing the GPU from the PCI bus:

echo 1 > /sys/bus/pci/devices/0000:00:1e.0/remove

Enabling UEFI Secure Boot for OCP Workers on AWS

AWS does not publish any Linux AMI (including RHCOS) with UEFI Secure Boot enabled. The stock RHCOS AMIs ship with legacy-bios boot mode. Since RHCOS uses a unified BIOS/UEFI partition layout (since OCP 4.3), the same disk image can boot in UEFI mode -- we just need to re-register the AMI with the correct boot mode and a UEFI variable store containing Secure Boot keys.

This only applies to Nitro-based virtualized instance types (e.g. m5, m6i, g4dn). Bare metal instances do not support UEFI Secure Boot. Note that G4ad instances are

Name Age Blood
Mary Clan 82 A+
Jane Doe 46 AB+
Ted Coe 50 B-
Bill Murray 44 A-
Clint East 52 B+
Spencer Tracy 79 B+
Bill Ted 56 0+
Bob Smith 66 A-
Bill Cosby 81 A+
@josecastillolema
josecastillolema / restart_audio.sh
Last active March 17, 2024 18:09
Tricks desktop
systemctl --user restart wireplumber pipewire pipewire-pulse
@josecastillolema
josecastillolema / tricks.md
Last active February 24, 2026 09:56
Tricks

K8s / OpenShift

One liners

Create a debug pod:

kubectl run -it --tty --rm debug --image=alpine --restart=Never -- sh -n <namespace>

Create a pod: