Skip to content

Instantly share code, notes, and snippets.

# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 8 CUDA Capable device(s)
Device 0: "NVIDIA A100-SXM4-80GB"
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 8.0
./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 8 CUDA Capable device(s)
Device 0: "NVIDIA A100 80GB PCIe"
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 8.0
./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100 80GB PCIe, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A100 80GB PCIe, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA A100 80GB PCIe, pciBusID: 43, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA A100 80GB PCIe, pciBusID: 45, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A100 80GB PCIe, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA A100 80GB PCIe, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA A100 80GB PCIe, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA A100 80GB PCIe, pciBusID: c6, pciDeviceID: 0, pciDomainID:0
# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100-SXM4-80GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A100-SXM4-80GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA A100-SXM4-80GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA A100-SXM4-80GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A100-SXM4-80GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA A100-SXM4-80GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA A100-SXM4-80GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA A100-SXM4-80GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0
# git checkout v11.4
Note: switching to 'v11.4'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
https://www.microsoft.com/en-us/edge/download?form=MA13FJ
$ sudo dnf install microsoft-edge-stable-111.0.1661.41-1.x86_64.rpm
[sudo] password for gengwg:
Windscribe 5.3 kB/s | 2.9 kB 00:00
Dependencies resolved.
===============================================================================================================================================================================================
Package Architecture Version Repository Size
===============================================================================================================================================================================================
Installing:
@gengwg
gengwg / dnf-reposync.md
Created January 21, 2023 01:06
Using DNF to Download/Sync with Local Repo

Using DNF to Download/Sync with Local Repo

Command:

# download to current repo
$ dnf reposync --repoid=windscribe --download-metadata -p .
Windscribe                                                                                                                                                     4.6 kB/s | 2.9 kB     00:00    
Windscribe                                                                                                                                                     8.2 kB/s |  11 kB     00:01    
@gengwg
gengwg / debian_unable_mount_rootfs.md
Created November 25, 2022 00:15
not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

Problem

Similar to

wn-block(0,0)
[    0.667378] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.47-1-MANJARO #1
[    0.667435] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.04 04/26/2016
[    0.667493]  ffffc90000c8bde0 ffffffff813151d2 ffff880276a77000 ffffffff8190b950
[    0.667717]  ffffc90000c8be68 ffffffff8117ecd4 ffffffff00000010 ffffc90000c8be78
@gengwg
gengwg / nvml_cgroupv2_fix.md
Last active July 10, 2024 07:10
Fix jobs originally seeing the GPUs fine, suddenly nvml goes away after a few hours

NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.

DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.

TLDR

Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.

Problem

# v1.22.9
## build image
gengwg@gengwg-mbp:~$ cd go/src/k8s.io/kubernetes/
gengwg@gengwg-mbp:~/go/src/k8s.io/kubernetes$ git checkout v1.22.9
Updating files: 100% (6336/6336), done.
Previous HEAD position was ad3338546da Release commit for Kubernetes v1.23.6
HEAD is now at 6df4433e288 Release commit for Kubernetes v1.22.9