-
-
Save damico/484f7b0a148a0c5f707054cf9c0a0533 to your computer and use it in GitHub Desktop.
import torch, grp, pwd, os, subprocess | |
devices = [] | |
try: | |
print("\n\nChecking ROCM support...") | |
result = subprocess.run(['rocminfo'], stdout=subprocess.PIPE) | |
cmd_str = result.stdout.decode('utf-8') | |
cmd_split = cmd_str.split('Agent ') | |
for part in cmd_split: | |
item_single = part[0:1] | |
item_double = part[0:2] | |
if item_single.isnumeric() or item_double.isnumeric(): | |
new_split = cmd_str.split('Agent '+item_double) | |
device = new_split[1].split('Marketing Name:')[0].replace(' Name: ', '').replace('\n','').replace(' ','').split('Uuid:')[0].split('*******')[1] | |
devices.append(device) | |
if len(devices) > 0: | |
print('GOOD: ROCM devices found: ', len(devices)) | |
else: | |
print('BAD: No ROCM devices found.') | |
print("Checking PyTorch...") | |
x = torch.rand(5, 3) | |
has_torch = False | |
len_x = len(x) | |
if len_x == 5: | |
has_torch = True | |
for i in x: | |
if len(i) == 3: | |
has_torch = True | |
else: | |
has_torch = False | |
if has_torch: | |
print('GOOD: PyTorch is working fine.') | |
else: | |
print('BAD: PyTorch is NOT working.') | |
print("Checking user groups...") | |
user = os.getlogin() | |
groups = [g.gr_name for g in grp.getgrall() if user in g.gr_mem] | |
gid = pwd.getpwnam(user).pw_gid | |
groups.append(grp.getgrgid(gid).gr_name) | |
if 'render' in groups and 'video' in groups: | |
print('GOOD: The user', user, 'is in RENDER and VIDEO groups.') | |
else: | |
print('BAD: The user', user, 'is NOT in RENDER and VIDEO groups. This is necessary in order to PyTorch use HIP resources') | |
if torch.cuda.is_available(): | |
print("GOOD: PyTorch ROCM support found.") | |
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda') | |
print('Testing PyTorch ROCM support...') | |
if str(t) == "tensor([5, 5, 5], device='cuda:0')": | |
print('Everything fine! You can run PyTorch code inside of: ') | |
for device in devices: | |
print('---> ', device) | |
else: | |
print("BAD: PyTorch ROCM support NOT found.") | |
except: | |
print('Cannot find rocminfo command information. Unable to determine if AMDGPU drivers with ROCM support were installed.') |
This script didn't find the rocminfo binary eventhough it is installed and functioning as the current user
hydrian@balor ~/tmp $ which rocminfo
/usr/bin/rocminfo
hydrian@balor ~/tmp $ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 5 2600X Six-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 5 2600X Six-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16310472(0xf8e0c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16310472(0xf8e0c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16310472(0xf8e0c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1032
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6600 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 29695(0x73ff)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2900
BDFID: 3328
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 4
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1032
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
What distro are you running? Just tested again on ubuntu 22.04 and the rocm binary was found
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user ... is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
---> Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
---> gfx1012
Mint 21.1 (Ubuntu 22.04) rocm works with GPU support on stable-diffusion.
Hi, noob here. My machine says, 'is NOT in RENDER and VIDEO groups.'
But [g.gr_name for g in grp.getgrall()] contains render and video both. Do you suggest any ideas on how I can fix that? Should I just include the user in the groups?
Hi, noob here. My machine says, 'is NOT in RENDER and VIDEO groups.' But [g.gr_name for g in grp.getgrall()] contains render and video both. Do you suggest any ideas on how I can fix that? Should I just include the user in the groups?
Hello, sure. User should be added to those groups.
Hi, noob here. My machine says, 'is NOT in RENDER and VIDEO groups.' But [g.gr_name for g in grp.getgrall()] contains render and video both. Do you suggest any ideas on how I can fix that? Should I just include the user in the groups?
Hello, sure. User should be added to those groups.
Thanks for the answer.
I tried that and added the user into those two groups. However, the line t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
just goes on, and I had to forcefully stop the program. Do you think my GPU is not supported?
also brother im having this output with the script
(amd) mruserbox@guru-X99:/media/10TB_HHD/_AMD$ python test.py
Checking ROCM support...
BAD: No ROCM devices found.
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
BAD: The user mruserbox is NOT in RENDER and VIDEO groups. This is necessary in order to PyTorch use HIP resources
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
(amd) mruserbox@guru-X99:/media/10TB_HHD/_AMD$ rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 56.0c 13.0W 500Mhz 96Mhz 0% auto 215.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
I already found the problem brother, i had to add my user to vide and render group, i did the follow
sudo usermod -a -G video mruserbox
sudo usermod -a -G render mruserbox
now i had the follow output
(amd) mruserbox@guru-X99:/media/10TB_HHD/_AMD$ python test.py
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user mruserbox is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Cannot find rocminfo command information. Unable to determine if AMDGPU drivers with ROCM support were installed.
Now its not recognizing rocminfo going to check why
(amd) mruserbox@guru-X99:/media/10TB_HHD/_AMD$ python test.py
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user mruserbox is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
A runtime error occurred: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/media/10TB_HHD/_AMD/test.py", line 55, in <module>
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
got the error solved! I had to select the right cuda device of my system, i changed it to device='cuda:1' now everything passed great!
(amd) mruserbox@guru-X99:/media/10TB_HHD/_AMD$ python test.py
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user mruserbox is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
---> Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz
---> gfx1030
I don't know if rocm is supported or not.. gfx1036 ryzen 7 7700
python testrocm.py
Checking ROCM support...
GOOD: ROCM devices found: 1
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
Cannot find rocminfo command information. Unable to determine if AMDGPU drivers with ROCM support were installed.
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 7700 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7700 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3800
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15865032(0xf214c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15865032(0xf214c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15865032(0xf214c8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*** Done ***
rocm-smi
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[0] : sclk clock is unsupported
====================================================================================
GPU[0] : get_power_cap, Not supported on the given system
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 47.0c 37.153W None 2600Mhz 0% auto Unsupported 3% 0%
====================================================================================
=============================== End of ROCm SMI Log ================================
Thanks for the script. Actually I came here because I got black image in ComfyUI stable diffusion image generation and got no object detection in ultralytics. My laptop used to be able to generate image in ComfyUI so I am doubt that I got problem upgrading AMD Driver to v5.6.1. I just downgraded the ROCm Driver to v5.5.3 and now everything works.
In both cases my I passed the PyTorch test.So I guess it would be great if additional PyTorch test could be added.
amdgpu driver or pytorch driver will not add your current account into the render and video groups directly.
Due to the rocminfo
need access the /dev/kfd
and /dev/dri
which owned by render or video group.
crw-rw---- 1 root render 235, 0 Oct. 7 17:56 /dev/kfd
drwxr-xr-x 3 root root 120 Oct. 7 17:56 ./
drwxr-xr-x 20 root root 4260 Oct. 7 17:56 ../
drwxr-xr-x 2 root root 100 Oct. 7 17:56 by-path/
crw-rw----+ 1 root video 226, 0 Oct. 7 17:56 card0
crw-rw----+ 1 root video 226, 1 Oct. 7 17:56 card1
crw-rw----+ 1 root render 226, 128 Oct. 7 17:56 renderD128
you should add current groups into this two groups in linux command line, suck like
sudo usermod -aG render,video <your_current_user_name>
using groups <user_name>
or id <user_name>
or cat /etc/group
to doule confirm.
Then reboot
your linux machine make sure the changes works.
Run this script again.
(Optional)if necessary add sudo or root priviledge.
sudo usermod -aG render,video,sudo,root <your_current_user_name>
Removing user <usr_name> from group root
sudo gpasswd -d <usr_name> root
or sudo deluser <usr_name> root
Depending the distro, make sure the user may need different groups memberships. In the Debian distro family, the user needs to have membership to both the video
and render
groups.
Also remember, actual group members is only applied at user login. So even if you add the group to the user and it shows up in the groups
output, the user may not permission to the groups resources until that user does a login again.
if errors
fix
import torch, grp, pwd, os, subprocess
import getpass
devices = []
try:
print("\n\nChecking ROCM support...")
result = subprocess.run(['rocminfo'], stdout=subprocess.PIPE)
cmd_str = result.stdout.decode('utf-8')
cmd_split = cmd_str.split('Agent ')
for part in cmd_split:
item_single = part[0:1]
item_double = part[0:2]
if item_single.isnumeric() or item_double.isnumeric():
new_split = cmd_str.split('Agent '+item_double)
device = new_split[1].split('Marketing Name:')[0].replace(' Name: ', '').replace('\n','').replace(' ','').split('Uuid:')[0].split('*******')[1]
devices.append(device)
if len(devices) > 0:
print('GOOD: ROCM devices found: ', len(devices))
else:
print('BAD: No ROCM devices found.')
print("Checking PyTorch...")
x = torch.rand(5, 3)
has_torch = False
len_x = len(x)
if len_x == 5:
has_torch = True
for i in x:
if len(i) == 3:
has_torch = True
else:
has_torch = False
if has_torch:
print('GOOD: PyTorch is working fine.')
else:
print('BAD: PyTorch is NOT working.')
print("Checking user groups...")
user = getpass.getuser()
groups = [g.gr_name for g in grp.getgrall() if user in g.gr_mem]
gid = pwd.getpwnam(user).pw_gid
groups.append(grp.getgrgid(gid).gr_name)
if 'render' in groups and 'video' in groups:
print('GOOD: The user', user, 'is in RENDER and VIDEO groups.')
else:
print('BAD: The user', user, 'is NOT in RENDER and VIDEO groups. This is necessary in order to PyTorch use HIP resources')
if torch.cuda.is_available():
print("GOOD: PyTorch ROCM support found.")
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
print('Testing PyTorch ROCM support...')
if str(t) == "tensor([5, 5, 5], device='cuda:0')":
print('Everything fine! You can run PyTorch code inside of: ')
for device in devices:
print('---> ', device)
else:
print("BAD: PyTorch ROCM support NOT found.")
except:
print('Cannot find rocminfo command information. Unable to determine if AMDGPU drivers with ROCM support were installed.')
and
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
https://pytorch.org/get-started/locally/
(Linux Mint 21)
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user roman is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
---> AMD Ryzen 5 5500U with Radeon Graphics
---> gfx90c
useful to check for a hip.
if torch.cuda.is_available() and torch.version.hip:
Hi,
mine says.
`python3 test.py
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user push is in RENDER and VIDEO groups.
BAD: PyTorch ROCM support NOT found.
`
I don't know how to get rid of this.
I am pretty noob with gpu related.Thanks
@iampaulidrobo Did you install the ROCm version of PyTorch?
You can find the latest PyTorch installation in PyTorch Website. Scroll down a bit you would find the installation option like the screenshot below. Choose the option like the screenshot you would get the instruction.
In short, you can uninstall the PyTorch and install PyTorch v2.4 using the below command
# Uninstall the previous version of PyTorch
pip uninstall torch torchvision torchaudio
# Install PyTorch ROCm version
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
The script ran successfully for me but then no floating point stuff worked. Turns out I needed to override the GPU model to something else ROCm/ROCm#2536 (comment)
I'd suggest doing something like
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
t_f32 = torch.tensor([5, 5, 5], dtype=torch.float32, device='cuda')
print('Testing PyTorch ROCM support...')
if str(t) == "tensor([5, 5, 5], device='cuda:0')" and str(t_f32) == "tensor([5., 5., 5.], device='cuda:0')":
So that you know floating point is also working.
Saved me a lot of time, thanks!