W Geng gengwg

Using DNF to Download/Sync with Local Repo

Command:

# download to current repo
$ dnf reposync --repoid=windscribe --download-metadata -p .
Windscribe                                                                                                                                                     4.6 kB/s | 2.9 kB     00:00    
Windscribe                                                                                                                                                     8.2 kB/s |  11 kB     00:01

Problem

Similar to

wn-block(0,0)
[    0.667378] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.47-1-MANJARO #1
[    0.667435] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.04 04/26/2016
[    0.667493]  ffffc90000c8bde0 ffffffff813151d2 ffff880276a77000 ffffffff8190b950
[    0.667717]  ffffc90000c8be68 ffffffff8117ecd4 ffffffff00000010 ffffc90000c8be78

NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.

DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.

TLDR

Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.

	# ./deviceQuery
	./deviceQuery Starting...

	CUDA Device Query (Runtime API) version (CUDART static linking)

	Detected 8 CUDA Capable device(s)

	Device 0: "NVIDIA A100-SXM4-80GB"
	CUDA Driver Version / Runtime Version 11.4 / 11.4
	CUDA Capability Major/Minor version number: 8.0

	./deviceQuery
	./deviceQuery Starting...

	CUDA Device Query (Runtime API) version (CUDART static linking)

	Detected 8 CUDA Capable device(s)

	Device 0: "NVIDIA A100 80GB PCIe"
	CUDA Driver Version / Runtime Version 11.4 / 11.4
	CUDA Capability Major/Minor version number: 8.0

	./p2pBandwidthLatencyTest
	[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
	Device: 0, NVIDIA A100 80GB PCIe, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
	Device: 1, NVIDIA A100 80GB PCIe, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
	Device: 2, NVIDIA A100 80GB PCIe, pciBusID: 43, pciDeviceID: 0, pciDomainID:0
	Device: 3, NVIDIA A100 80GB PCIe, pciBusID: 45, pciDeviceID: 0, pciDomainID:0
	Device: 4, NVIDIA A100 80GB PCIe, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
	Device: 5, NVIDIA A100 80GB PCIe, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
	Device: 6, NVIDIA A100 80GB PCIe, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
	Device: 7, NVIDIA A100 80GB PCIe, pciBusID: c6, pciDeviceID: 0, pciDomainID:0

	# ./p2pBandwidthLatencyTest
	[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
	Device: 0, NVIDIA A100-SXM4-80GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
	Device: 1, NVIDIA A100-SXM4-80GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0
	Device: 2, NVIDIA A100-SXM4-80GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0
	Device: 3, NVIDIA A100-SXM4-80GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0
	Device: 4, NVIDIA A100-SXM4-80GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
	Device: 5, NVIDIA A100-SXM4-80GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0
	Device: 6, NVIDIA A100-SXM4-80GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0
	Device: 7, NVIDIA A100-SXM4-80GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0

	# git checkout v11.4
	Note: switching to 'v11.4'.

	You are in 'detached HEAD' state. You can look around, make experimental
	changes and commit them, and you can discard any commits you make in this
	state without impacting any branches by switching back to a branch.

	If you want to create a new branch to retain commits you create, you may
	do so (now or later) by using -c with the switch command. Example:

	https://www.microsoft.com/en-us/edge/download?form=MA13FJ

	$ sudo dnf install microsoft-edge-stable-111.0.1661.41-1.x86_64.rpm
	[sudo] password for gengwg:
	Windscribe 5.3 kB/s \| 2.9 kB 00:00
	Dependencies resolved.
	===============================================================================================================================================================================================
	Package Architecture Version Repository Size
	===============================================================================================================================================================================================
	Installing:

	# v1.22.9

	## build image

	gengwg@gengwg-mbp:~$ cd go/src/k8s.io/kubernetes/
	gengwg@gengwg-mbp:~/go/src/k8s.io/kubernetes$ git checkout v1.22.9
	Updating files: 100% (6336/6336), done.
	Previous HEAD position was ad3338546da Release commit for Kubernetes v1.23.6
	HEAD is now at 6df4433e288 Release commit for Kubernetes v1.22.9