Kernel example

Writing GPU kernels used to feel like wizardry—arcane knowledge understood by few. But it’s really about unlocking fine-grained control over memory and threads, typically via CUDA (NVIDIA GPUs), Triton (OpenAI’s language for custom kernels), or ROCm (AMD GPUs).

Here’s a quick dive into CUDA, using examples to illustrate key optimization techniques:

Vectorization
Simultaneously process contiguous data elements to reduce latency.

__global__ void vectorize_add(float *a, float *b, float *c, int n) {  
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  
    if (idx < n) {  
        c[idx] = a[idx] + b[idx];  
    }  
}  
// Call kernel with blocks and threads sized to match input length

Parallelization
Divide input into chunks so many threads process data in parallel.

__global__ void parallel_add(float *a, float *b, float *c, int n) {  
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  
    if (idx < n) {  
        c[idx] = a[idx] + b[idx];  
    }  
}  
// Example: Launch with <<<numBlocks, threadsPerBlock>>>

Loop Tiling
Optimize memory access by structuring loops to align with the GPU’s cache layout.

__global__ void tiled_matrix_mul(float *A, float *B, float *C, int N) {  
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];  
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];  

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;  
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;  
    float val = 0.0;  

    for (int i = 0; i < N / TILE_SIZE; ++i) {  
        tileA[threadIdx.y][threadIdx.x] = A[row * N + (i * TILE_SIZE + threadIdx.x)];  
        tileB[threadIdx.y][threadIdx.x] = B[(i * TILE_SIZE + threadIdx.y) * N + col];  
        __syncthreads();  

        for (int j = 0; j < TILE_SIZE; ++j) {  
            val += tileA[threadIdx.y][j] * tileB[j][threadIdx.x];  
        }  
        __syncthreads();  
    }  
    C[row * N + col] = val;  
}

Operator Fusion
Merge multiple passes to reduce redundant memory IO.

__global__ void fused_add_multiply(float *x, float *y, float *result, int n) {  
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  
    if (idx < n) {  
        float temp = x[idx] + y[idx];  
        result[idx] = temp * 2.0f;  
    }  
}  
// Instead of separate kernels for addition and multiplication, combine them!

While techniques like vectorization, parallelization, and tiling can be broadly applied, operator fusion often requires a deep understanding of the specific model or workload. That’s why it’s both powerful and challenging—it’s where engineers earn their stripes.

Custom kernels are the key to squeezing every last drop of performance out of GPUs. With tools like CUDA and Triton, the “dark art” of kernel writing is now more accessible than ever.

yongkangc/Kernel.md