Skip to content

Instantly share code, notes, and snippets.

@rain-1
Last active November 20, 2025 10:56
Show Gist options
  • Select an option

  • Save rain-1/1a551e77d2bed3e79c5875301894d0a8 to your computer and use it in GitHub Desktop.

Select an option

Save rain-1/1a551e77d2bed3e79c5875301894d0a8 to your computer and use it in GitHub Desktop.
cuda-beginners-guide.md

Tutorial: Vector Addition

Problem: Vector Addition

In this post we will cover how to solve the simplest CUDA problem: adding two arrays. I'll explain the code, step by step

Step 1 - Understanding the template

Here is the initial template that we are given:

#include <cuda_runtime.h>

// Note: d_input1, d_input2, d_output are all device pointers to float32 arrays
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
}

The key thing to understand here is that this is a host function, that runs on CPU, but it is taking pointers to device memory: arrays of floats that live on the GPU.

Step 2 - Calling out to a CUDA kernel

We will write a CUDA kernel called vectorAdd. First, I will show you how to call our CUDA kernel:

extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);
}

There are a few important things to learn here

  • (A) Threads and Blocks
  • (B) CUDA kernel call syntax uses triple < and >.
  • (C) The arithmetic for calculating the number of blocks

Let's focus on these

(A) Understanding Threads and Blocks

Adding an array up is an embarrassingly parallel operation, we can do every single cell individually. You might try using n threads and 1 block for a length n array then. But it will crash. It's a invalid configuration. What is a block? For starting out just know that a block is a group of threads.

We are using a Telsa T4 GPU, and every block can have a maximum of 1024 threads. So we need to divide the problem up into blocks, each having 1024 threads.

Here's a quick script to print out the exact limits of your GPU:

#include <cuda_runtime.h>
#include <iostream>

// Note: d_input1, d_input2, d_output are all device pointers to float32 arrays
extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    std::cout << "Max threads per block: " << prop.maxThreadsPerBlock << std::endl;
    std::cout << "Max threads per SM: "     << prop.maxThreadsPerMultiProcessor << std::endl;
    std::cout << "Warp size: "              << prop.warpSize << std::endl;
}

// Output when run on a Telsa T4
//
// Max threads per block: 1024
// Max threads per SM: 1024
// Warp size: 32

(B) Understanding CUDA Kernel Syntax

So when we invoke our CUDA kernel we do it like this vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);, that means that it will launch blocksPerGrid blocks, each containing threadsPerBlock threads. If our array length n is a perfect multiple of 1024 (threadsPerBlock) that will work out perfectly. If not, our very last block will have extra threads that don't really need to do anything: That's normal.

(C) Arithmetic

Here's a question to consider: why do we do int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock; to calculate the number of blocks we need?

Remember that integer division truncates! Imagine threadsPerBlock was 10, think about what would happen if n was 9, 10 and 11.

Hopefully that will demystify this calculation for you!

Step 3 - Indexing into an array using Blocks and Threads

Now here's how to implement the actual CUDA kernel code that will be compiled for and executed on the GPU. It needs the kernel function qualifier __global__ for this.

#include <cuda_runtime.h>

__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

extern "C" void solution(const float* d_input1, const float* d_input2, float* d_output, size_t n) {
    int threadsPerBlock = 1024;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_input1, d_input2, d_output, n);
}

Now there are 2 things happening in the vectorAdd function which are very common CUDA programming patterns:

  • The array index calculation using block and thread indices.
  • The bounds check: As mentioned, the length of the array does not perfectly fit in the threads per block. So you always needs to do a bounds check.

You can now compile this and view the PTX (Parallel Thread Execution) assembly code, and the even lower level device specific SASS (Streaming Assembler) code for a look under the hood. To see what this code gets compiled into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment