Skip to content

Instantly share code, notes, and snippets.

@razhangwei
Created February 26, 2025 05:59
Show Gist options
  • Save razhangwei/2be87db30be8b39053a006acd7c92ac2 to your computer and use it in GitHub Desktop.
Save razhangwei/2be87db30be8b39053a006acd7c92ac2 to your computer and use it in GitHub Desktop.
#CUDA cheatsheet

CUDA Cheatsheet for DeepGEMM

CUDA Fundamentals

Term Description
SM (Streaming Multiprocessor) Computational unit in NVIDIA GPUs; Hopper has up to 132 SMs
Warp Group of 32 threads that execute in lockstep
Thread Block Group of threads that execute on the same SM and can synchronize
Grid Collection of thread blocks that execute the same kernel
Shared Memory Fast memory shared by threads in a block (L1 cache)
L2 Cache Shared by all SMs, larger but slower than shared memory
Occupancy Ratio of active warps to maximum possible warps on an SM
Kernel GPU function executed by many threads in parallel
CUDA Core Scalar processor for general-purpose operations
Tensor Core Specialized hardware for matrix operations

Hopper Architecture (sm_90) Features

Feature Description
FP8 Tensor Cores Hardware acceleration for FP8 matrix operations with E4M3 and E5M2 formats
TMA (Tensor Memory Accelerator) Asynchronous data movement engine for efficiently loading/storing tensors
TMA Multicast Ability to send the same data to multiple locations in shared memory
TMA Descriptor Data structure describing layout and shape of tensors for TMA operations
WGMMA (Warp Group Matrix Multiply-Accumulate) New tensor core instruction format used by multiple warps
Warp Group Collection of 4 warps (128 threads) that can cooperate on tensor operations
Warp Specialization Technique where different warps perform different tasks (compute, load, store)

CUTLASS & CuTe Concepts

Term Description
CUTLASS CUDA Templates for Linear Algebra Subroutines - composable building blocks for GEMM
CuTe CUDA Tensor Extensions - layouts and abstractions for tensor operations
Collective Pattern for coordinating threads to perform a complex operation
Tile Partitioning of data for processing by blocks/warps/threads
MMA (Matrix Multiply-Accumulate) Core operation in dense linear algebra
Epilogue Post-processing after matrix multiplication (scaling, bias addition, activation)
Prologue Pre-processing before matrix multiplication
Threadblock Rasterization Scheduling pattern of thread blocks for better locality

Memory & Optimization Techniques

Technique Description
Double Buffering/Multi-Stage Pipeline Using multiple buffers to overlap computation and data loading
Swizzling Memory layout transformation to improve memory access patterns
Register Reconfiguration Reallocating register file usage for different thread groups
Persistent Kernel Long-running kernel that processes multiple tiles without kernel relaunch
Barriers Synchronization primitives (NamedBarrier, ClusterBarrier)
STMATRIX Matrix storage instruction for shared memory

PTX/SASS Level Concepts

Concept Description
PTX Parallel Thread Execution - intermediate representation for CUDA
SASS GPU assembly language (final compiled code)
FFMA Fused Multiply-Add operation (a*b+c in one instruction)
FFMA Interleaving Optimization technique that modifies instruction scheduling
yield and reuse bits Control bits in SASS that affect warp scheduling and register reuse
JIT (Just-In-Time) Compilation Runtime compilation for specialized code
__forceinline__ Directive to always inline a function
__launch_bounds__ Directive specifying max threads per block and min blocks per SM

FP8 and Matrix Operations

Term Description
FP8 E4M3 8-bit floating point format with 4 exponent bits and 3 mantissa bits
FP8 E5M2 8-bit floating point format with 5 exponent bits and 2 mantissa bits
Fine-grained Scaling Per-channel or per-row scaling factors for FP8 precision
Promotion Converting FP8 to higher precision (BF16/FP32) for accumulation
GEMM General Matrix Multiplication (C = A×B)
MMA Instruction Matrix-multiply-accumulate instruction executed by tensor cores
NT Format Non-transposed first matrix, transposed second matrix
Two-level Accumulation Using both tensor cores and CUDA cores for precise accumulation

DeepGEMM-Specific Concepts

Concept Description
Normal GEMM Standard matrix multiplication
Grouped Contiguous GEMM Multiplication where inputs are concatenated into a single tensor
Grouped Masked GEMM Multiplication where a mask indicates which parts to compute
BLOCK_M/N/K Thread block tile sizes in M, N, K dimensions
M Alignment Alignment requirement for dimension M (typically 128)
TMA Alignment Memory alignment for Tensor Memory Accelerator (16 bytes)
Cell Div Ceiling division utility ((a + b - 1) / b)
Unaligned Block Sizes Non-power-of-2 block sizes for better SM utilization

Key Functions in the DeepGEMM Codebase

Function Purpose
fp8_gemm_kernel Core GEMM implementation (~300 lines)
wgmma Wrapper for tensor core instructions
tma_copy TMA data loading function
make_smem_desc Creates shared memory descriptors for tensor operations
get_best_configs Determines optimal parameters for a given problem size
interleave_ffma.process Post-processes binary to optimize instruction scheduling
set_num_sms/get_num_sms Control/query maximum SMs to use

This cheatsheet covers the key concepts needed to understand the DeepGEMM repository's implementation of high-performance FP8 matrix multiplications on NVIDIA Hopper GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment