-
-
Save Hermann-SW/8f82f3f16a8f6b93e7a222405933bec9 to your computer and use it in GitHub Desktop.
see comments |
https://github.com/rsnemmen/OpenCL-examples
hermann@RX-Vega56:~/OpenCL-examples/add_numbers$ ls
add_numbers.c add_numbers.cl Makefile README.md
hermann@RX-Vega56:~/OpenCL-examples/add_numbers$ git diff
diff --git a/Hello_World/hello.c b/Hello_World/hello.c
index 49350af..7fa57c3 100644
--- a/Hello_World/hello.c
+++ b/Hello_World/hello.c
@@ -58,7 +58,7 @@
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
-#include <OpenCL/opencl.h>
+#include <CL/opencl.h>
////////////////////////////////////////////////////////////////////////////////
hermann@RX-Vega56:~/OpenCL-examples/add_numbers$ make
gcc -std=c99 -Wall -DUNIX -g -DDEBUG -m64 -o add_numbers add_numbers.c -lOpenCL
In file included from /usr/include/CL/cl.h:20,
from add_numbers.c:14:
/usr/include/CL/cl_version.h:21:9: note: ‘#pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)’
21 | #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)")
| ^~~~~~~
add_numbers.c: In function ‘main’:
add_numbers.c:197:4: warning: ‘clCreateCommandQueue’ is deprecated [-Wdeprecated-declarations]
197 | queue = clCreateCommandQueue(context, device, 0, &err);
| ^~~~~
/usr/include/CL/cl.h:1801:1: note: declared here
1801 | clCreateCommandQueue(cl_context context,
| ^~~~~~~~~~~~~~~~~~~~
hermann@RX-Vega56:~/OpenCL-examples/add_numbers$ ./add_numbers
Computed sum = 2016.0.
Check passed.
hermann@RX-Vega56:~/OpenCL-examples/add_numbers$
While latest AMD ROCM version is currently 6.4.1, it does not support AMD Instinct MI50 GPUs anymore.
For example rocHPL (High Performance Linpack, AMD HPC Accelerators) does not compile with 6.4.1 rocm for Instinct MI50:
https://www.mersenneforum.org/node/1070049?p=1080817#post1080817
But 6.3.3 rocm (released 2/2025) works fine, and Instinct MI50 (released 11/2018) shows 4.7196 TFLOPs FP64 with rocHPL (see previous posting). Theoretical performance for FP64 (double) is 6.705 TFLOPS (see Intinct MI50 link).
"Introduction to the HIP programming model" from docs:
https://rocm.docs.amd.com/projects/HIP/en/docs-6.3.3/understand/programming_model.html
"ROCm Revisited: Getting Started with HIP" on rocm blogs:
https://rocm.blogs.amd.com/ecosystems-and-partners/rocm-revisited-hip/README.html
rocm examples github repo discussed in that blog:
https://github.com/ROCm/rocm-examples
The discussed hello world example hangs on a Vega56 (with rocm 6.3.1), though:
hermann@RX-Vega56:~/rocm-examples/HIP-Basic/hello_world$ make
/opt/rocm/bin/hipcc -std=c++17 -Wall -Wextra -I ../../Common -o hip_hello_world main.hip
hermann@RX-Vega56:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
Completes on Vega64 (with rocm 6.3.3), but misses GPU thread outputs:
hermann@Radeons:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
hermann@Radeons:~/rocm-examples/HIP-Basic/hello_world$
Different output on Radeon pro vii (with rocm 6.3.3) than discussed on blog article:
hermann@Radeon-pro-vii:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device kernel block 1 thread 0!
Hello world from device kernel block 1 thread 1!
Hello world from device kernel block 0 thread 0!
Hello world from device kernel block 0 thread 1!
hermann@Radeon-pro-vii:~/rocm-examples/HIP-Basic/hello_world$
Instinct MI50 (with rocm 6.3.3) behaves like RX Vega64:
hermann@7950x:~/rocm-examples/HIP-Basic/hello_world$ rocminfo | grep Uuid.*GPU
Uuid: GPU-c49e19417337ece3
Uuid: GPU-f890794172e62691
Uuid: GPU-13c24061732c730c
hermann@7950x:~/rocm-examples/HIP-Basic/hello_world$ export ROCR_VISIBLE_DEVICES="0,GPU-c49e19417337ece3"
hermann@7950x:~/rocm-examples/HIP-Basic/hello_world$ rocminfo | grep Uuid.*GPU
Uuid: GPU-c49e19417337ece3
hermann@7950x:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
hermann@7950x:~/rocm-examples/HIP-Basic/hello_world$
Wait, the blog gpuHelloWorld example is different to above repo example:
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ cat gpuHelloWorld.cpp
#include <hip/hip_runtime.h>
__global__ void gpuHelloWorld(int N)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid < N)
{
printf("Hello world from block %d and thread %d \n", blockIdx.x, gid);
}
}
int main()
{
int gridSize = 4; //Four blocks
int blockSize = 4; //Threads per block
int N = gridSize * blockSize; //Size of problem
gpuHelloWorld<<<gridSize, blockSize>>>(N); //Launch kernel
return 0;
}
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$
It does not work on Instinct MI50, but does work as discussed on blog on Radeon pro vii:
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ hipcc gpuHelloWorld.cpp -o gpuHelloWorld
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld
Hello world from block 0 and thread 0
Hello world from block 0 and thread 1
Hello world from block 0 and thread 2
Hello world from block 0 and thread 3
Hello world from block 1 and thread 4
Hello world from block 1 and thread 5
Hello world from block 1 and thread 6
Hello world from block 1 and thread 7
Hello world from block 2 and thread 8
Hello world from block 2 and thread 9
Hello world from block 2 and thread 10
Hello world from block 2 and thread 11
Hello world from block 3 and thread 12
Hello world from block 3 and thread 13
Hello world from block 3 and thread 14
Hello world from block 3 and thread 15
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld
Hello world from block 2 and thread 8
Hello world from block 2 and thread 9
Hello world from block 2 and thread 10
Hello world from block 2 and thread 11
Hello world from block 3 and thread 12
Hello world from block 3 and thread 13
Hello world from block 3 and thread 14
Hello world from block 3 and thread 15
Hello world from block 0 and thread 0
Hello world from block 0 and thread 1
Hello world from block 0 and thread 2
Hello world from block 0 and thread 3
Hello world from block 1 and thread 4
Hello world from block 1 and thread 5
Hello world from block 1 and thread 6
Hello world from block 1 and thread 7
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$
Nice, my Radeon pro vii GPU has 3,840 cores, and according first comment demo (as well as shown clinfo output) 60 CUs. AMD GPUs (at least those I have shown) have #CUs*64==#cores. So I modified gridsize (# CUs) and blocksize, and GPU responds with 3,840 messages, from blocks 0..59 and with core numbers 0..3839 ...
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ diff gpuHelloWorld.cpp.orig gpuHelloWorld.cpp
14,15c14,15
< int gridSize = 4; //Four blocks
< int blockSize = 4; //Threads per block
---
> int gridSize = 60; //Four blocks
> int blockSize = 64; //Threads per block
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ hipcc gpuHelloWorld.cpp -o gpuHelloWorld
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld > out
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ sort -u out | wc --lines
3840
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ sort -k8 -n out | head -3
Hello world from block 0 and thread 0
Hello world from block 0 and thread 1
Hello world from block 0 and thread 2
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$ sort -k8 -n out | tail -3
Hello world from block 59 and thread 3837
Hello world from block 59 and thread 3838
Hello world from block 59 and thread 3839
hermann@Radeon-pro-vii:~/rocm-examples/blog/gpuHelloWorld$
I looked into the hello_world example source, and the output shown for Radeon pro vii is correct (dimensions 2x2):
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ tail -18 main.hip
}
int main()
{
print_hello_host();
print_hello();
// Launch the kernel.
helloworld_kernel<<<dim3(2), // 3D grid specifying number of blocks to launch: (2, 1, 1)
dim3(2), // 3D grid specifying number of threads to launch: (2, 1, 1)
0, // number of bytes of additional shared memory to allocate
hipStreamDefault // stream where the kernel should execute: default stream
>>>();
// Wait on all active streams on the current device.
HIP_CHECK(hipDeviceSynchronize());
}
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$
The Radeon pro vii is single GPU in its PC, and so I tried Radeon vii which is single GPU as well, and it works:
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device kernel block 0 thread 0!
Hello world from device kernel block 0 thread 1!
Hello world from device kernel block 1 thread 0!
Hello world from device kernel block 1 thread 1!
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$
Not sure yet what the other GPU problems are (Vega56 is single GPU in its PC, "Radeons" PC has ̶t̶w̶o̶ (now) single Vega64 and 7950x PC has three MI50s).
Running that demo on all 3,840 cores of the Radeon vii works as well, in the demo the threads are numbered in range 0..63 in their block and not 0..3839 as output shown in blog demo:
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ git diff
diff --git a/HIP-Basic/hello_world/main.hip b/HIP-Basic/hello_world/main.hip
index 584f4190..44e74211 100644
--- a/HIP-Basic/hello_world/main.hip
+++ b/HIP-Basic/hello_world/main.hip
@@ -67,8 +67,8 @@ int main()
print_hello();
// Launch the kernel.
- helloworld_kernel<<<dim3(2), // 3D grid specifying number of blocks to launch: (2, 1, 1)
- dim3(2), // 3D grid specifying number of threads to launch: (2, 1, 1)
+ helloworld_kernel<<<dim3(60), // 3D grid specifying number of blocks to launch: (2, 1, 1)
+ dim3(64), // 3D grid specifying number of threads to launch: (2, 1, 1)
0, // number of bytes of additional shared memory to allocate
hipStreamDefault // stream where the kernel should execute: default stream
>>>();
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ make
/opt/rocm/bin/hipcc -std=c++17 -Wall -Wextra -I ../../Common -o hip_hello_world main.hip
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world > out
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ grep "device kernel" out | wc --lines
3840
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ grep "device kernel" out | sort -u | wc --lines
3840
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ grep "device kernel" out | sort -k7 -k9 -n| head -3
Hello world from device kernel block 0 thread 0!
Hello world from device kernel block 0 thread 1!
Hello world from device kernel block 0 thread 2!
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$ grep "device kernel" out | sort -k7 -k9 -n| tail -3
Hello world from device kernel block 59 thread 61!
Hello world from device kernel block 59 thread 62!
Hello world from device kernel block 59 thread 63!
hermann@Radeon-vii:~/rocm-examples/HIP-Basic/hello_world$
Not sure yet what the other GPU problems are (Vega56 is single GPU in its PC,
"Radeons" PC has ̶t̶w̶o̶ (now) single Vega64 and 7950x PC has three MI50s).
Found it — at least for Instinct MI50 GPUs.
I remembered that on the AMD 7950X CPU PC I had tried to disable internal graphics card in Bios, but was not successful before moving to basement. The internal "GPU-XX" seems to make the problems:
hermann@7950x:~$ !rocm
rocminfo | grep Uuid.*GPU
Uuid: GPU-c49e19417337ece3
Uuid: GPU-f890794172e62691
Uuid: GPU-13c24061732c730c
Uuid: GPU-XX
hermann@7950x:~$
My AMD 7600X CPU PC has four Instinct MI50s, and internal graphics card disabled in Bios (no GPU-XX):
hermann@7600x:~$ rocminfo | grep Uuid
Uuid: CPU-XX
Uuid: GPU-d64a58a17330f0ed
Uuid: GPU-6e56508172dc76b6
Uuid: GPU-6a0e7961732c730d
Uuid: GPU-304c70e172dc768c
hermann@7600x:~$
And here the blog hello world works (not he very first time, but two times after that):
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$ hipcc gpuHelloWorld.cpp -o gpuHelloWorld
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld
Hello world from block 1 and thread 4
Hello world from block 1 and thread 5
Hello world from block 1 and thread 6
Hello world from block 1 and thread 7
Hello world from block 2 and thread 8
Hello world from block 2 and thread 9
Hello world from block 2 and thread 10
Hello world from block 2 and thread 11
Hello world from block 3 and thread 12
Hello world from block 3 and thread 13
Hello world from block 3 and thread 14
Hello world from block 3 and thread 15
Hello world from block 0 and thread 0
Hello world from block 0 and thread 1
Hello world from block 0 and thread 2
Hello world from block 0 and thread 3
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$ ./gpuHelloWorld
Hello world from block 0 and thread 0
Hello world from block 0 and thread 1
Hello world from block 0 and thread 2
Hello world from block 0 and thread 3
Hello world from block 3 and thread 12
Hello world from block 3 and thread 13
Hello world from block 3 and thread 14
Hello world from block 3 and thread 15
Hello world from block 1 and thread 4
Hello world from block 1 and thread 5
Hello world from block 1 and thread 6
Hello world from block 1 and thread 7
Hello world from block 2 and thread 8
Hello world from block 2 and thread 9
Hello world from block 2 and thread 10
Hello world from block 2 and thread 11
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$
So the most important to me Instinct MI50 GPUs now work (I will have to correctly disable internal graphics card on "7950x" PC).
That is no problem because both PCs (with 1850W PSU each) are operated headless in basement:
Next I tried the rocm-examples repo hello world, and it works as well.
AND it showed the reason why the first time calling gpuHelloWorld above showed no output:
"no ROCm-capable device is detected"
hermann@7600x:~/rocm-examples$ cd HIP-Basic/hello_world
hermann@7600x:~/rocm-examples/HIP-Basic/hello_world$ make
/opt/rocm/bin/hipcc -std=c++17 -Wall -Wextra -I ../../Common -o hip_hello_world main.hip
hermann@7600x:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
An error encountered: "no ROCm-capable device is detected" at main.hip:77
hermann@7600x:~/rocm-examples/HIP-Basic/hello_world$ ./hip_hello_world
Hello world from host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device or host!
Hello world from device kernel block 0 thread 0!
Hello world from device kernel block 0 thread 1!
Hello world from device kernel block 1 thread 0!
Hello world from device kernel block 1 thread 1!
hermann@7600x:~/rocm-examples/HIP-Basic/hello_world$
The rocm-examples repo hello world has this at the end of code:
$ tail -3 ../../HIP-Basic/hello_world/main.hip
// Wait on all active streams on the current device.
HIP_CHECK(hipDeviceSynchronize());
}
$
I added this simple equivalent to blog hello world:
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$ tail -4 gpuHelloWorld.cpp
assert(hipDeviceSynchronize() == hipSuccess);
return 0;
}
hermann@7600x:~/rocm-examples/blog/gpuHelloWorld$
https://github.com/michel-meneses/great-opencl-examples