Problem: Vector Addition
In this post we will cover how to solve the simplest CUDA problem: adding two arrays. I'll explain the code, step by step
Here is the initial template that we are given:
#include <cuda_runtime.h>
// Note: d_input1, d_input2, d_output are all device pointers to float32 arrays
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
}The key thing to understand here is that this is a host function, that runs on CPU, but it is taking pointers to device memory: arrays of floats that live on the GPU.
We will write a CUDA kernel called vectorAdd. First, I will show you how to call our CUDA kernel:
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);
}There are a few important things to learn here
- (A) Threads and Blocks
- (B) CUDA kernel call syntax uses triple
<and>. - (C) The arithmetic for calculating the number of blocks
Let's focus on these
Adding an array up is an embarrassingly parallel operation, we can do every single cell individually. You might try using n threads and 1 block for a length n array then. But it will crash. It's a invalid configuration. What is a block? For starting out just know that a block is a group of threads.
We are using a Telsa T4 GPU, and every block can have a maximum of 1024 threads. So we need to divide the problem up into blocks, each having 1024 threads.
Here's a quick script to print out the exact limits of your GPU:
#include <cuda_runtime.h>
#include <iostream>
// Note: d_input1, d_input2, d_output are all device pointers to float32 arrays
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
std::cout << "Max threads per block: " << prop.maxThreadsPerBlock << std::endl;
std::cout << "Max threads per SM: " << prop.maxThreadsPerMultiProcessor << std::endl;
std::cout << "Warp size: " << prop.warpSize << std::endl;
}
// Output when run on a Telsa T4
//
// Max threads per block: 1024
// Max threads per SM: 1024
// Warp size: 32So when we invoke our CUDA kernel we do it like this vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);, that means that it will launch blocksPerGrid blocks, each containing threadsPerBlock threads. If our array length n is a perfect multiple of 1024 (threadsPerBlock) that will work out perfectly. If not, our very last block will have extra threads that don't really need to do anything: That's normal.
Here's a question to consider: why do we do int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock; to calculate the number of blocks we need?
Remember that integer division truncates! Imagine threadsPerBlock was 10, think about what would happen if n was 9, 10 and 11.
Hopefully that will demystify this calculation for you!
Now here's how to implement the actual CUDA kernel code that will be compiled for and executed on the GPU. It needs the kernel function qualifier __global__ for this.
#include <cuda_runtime.h>
__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
int threadsPerBlock = 1024;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);
}Now there are 2 things happening in the vectorAdd function which are very common CUDA programming patterns:
- The array index calculation using block and thread indices.
- The bounds check: As mentioned, the length of the array does not perfectly fit in the threads per block. So you always needs to do a bounds check.
You can now compile this and view the PTX (Parallel Thread Execution) assembly code, and the even lower level device specific SASS (Streaming Assembler) code for a look under the hood. To see what this code gets compiled into.