razhangwei/cuda.md

Created February 26, 2025 05:59

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/razhangwei/2be87db30be8b39053a006acd7c92ac2.js"></script>
Save razhangwei/2be87db30be8b39053a006acd7c92ac2 to your computer and use it in GitHub Desktop.

Download ZIP

#CUDA cheatsheet

Raw

cuda.md

CUDA Cheatsheet for DeepGEMM

CUDA Fundamentals

Term	Description
SM (Streaming Multiprocessor)	Computational unit in NVIDIA GPUs; Hopper has up to 132 SMs
Warp	Group of 32 threads that execute in lockstep
Thread Block	Group of threads that execute on the same SM and can synchronize
Grid	Collection of thread blocks that execute the same kernel
Shared Memory	Fast memory shared by threads in a block (L1 cache)
L2 Cache	Shared by all SMs, larger but slower than shared memory
Occupancy	Ratio of active warps to maximum possible warps on an SM
Kernel	GPU function executed by many threads in parallel
CUDA Core	Scalar processor for general-purpose operations
Tensor Core	Specialized hardware for matrix operations

Hopper Architecture (sm_90) Features

Feature	Description
FP8 Tensor Cores	Hardware acceleration for FP8 matrix operations with E4M3 and E5M2 formats
TMA (Tensor Memory Accelerator)	Asynchronous data movement engine for efficiently loading/storing tensors
TMA Multicast	Ability to send the same data to multiple locations in shared memory
TMA Descriptor	Data structure describing layout and shape of tensors for TMA operations
WGMMA (Warp Group Matrix Multiply-Accumulate)	New tensor core instruction format used by multiple warps
Warp Group	Collection of 4 warps (128 threads) that can cooperate on tensor operations
Warp Specialization	Technique where different warps perform different tasks (compute, load, store)

CUTLASS & CuTe Concepts

Term	Description
CUTLASS	CUDA Templates for Linear Algebra Subroutines - composable building blocks for GEMM
CuTe	CUDA Tensor Extensions - layouts and abstractions for tensor operations
Collective	Pattern for coordinating threads to perform a complex operation
Tile	Partitioning of data for processing by blocks/warps/threads
MMA (Matrix Multiply-Accumulate)	Core operation in dense linear algebra
Epilogue	Post-processing after matrix multiplication (scaling, bias addition, activation)
Prologue	Pre-processing before matrix multiplication
Threadblock Rasterization	Scheduling pattern of thread blocks for better locality

Memory & Optimization Techniques

Technique	Description
Double Buffering/Multi-Stage Pipeline	Using multiple buffers to overlap computation and data loading
Swizzling	Memory layout transformation to improve memory access patterns
Register Reconfiguration	Reallocating register file usage for different thread groups
Persistent Kernel	Long-running kernel that processes multiple tiles without kernel relaunch
Barriers	Synchronization primitives (`NamedBarrier`, `ClusterBarrier`)
STMATRIX	Matrix storage instruction for shared memory

PTX/SASS Level Concepts

Concept	Description
PTX	Parallel Thread Execution - intermediate representation for CUDA
SASS	GPU assembly language (final compiled code)
FFMA	Fused Multiply-Add operation (a*b+c in one instruction)
FFMA Interleaving	Optimization technique that modifies instruction scheduling
yield and reuse bits	Control bits in SASS that affect warp scheduling and register reuse
JIT (Just-In-Time) Compilation	Runtime compilation for specialized code
`__forceinline__`	Directive to always inline a function
`__launch_bounds__`	Directive specifying max threads per block and min blocks per SM

FP8 and Matrix Operations

Term	Description
FP8 E4M3	8-bit floating point format with 4 exponent bits and 3 mantissa bits
FP8 E5M2	8-bit floating point format with 5 exponent bits and 2 mantissa bits
Fine-grained Scaling	Per-channel or per-row scaling factors for FP8 precision
Promotion	Converting FP8 to higher precision (BF16/FP32) for accumulation
GEMM	General Matrix Multiplication (C = A×B)
MMA Instruction	Matrix-multiply-accumulate instruction executed by tensor cores
NT Format	Non-transposed first matrix, transposed second matrix
Two-level Accumulation	Using both tensor cores and CUDA cores for precise accumulation

DeepGEMM-Specific Concepts

Concept	Description
Normal GEMM	Standard matrix multiplication
Grouped Contiguous GEMM	Multiplication where inputs are concatenated into a single tensor
Grouped Masked GEMM	Multiplication where a mask indicates which parts to compute
BLOCK_M/N/K	Thread block tile sizes in M, N, K dimensions
M Alignment	Alignment requirement for dimension M (typically 128)
TMA Alignment	Memory alignment for Tensor Memory Accelerator (16 bytes)
Cell Div	Ceiling division utility (`(a + b - 1) / b`)
Unaligned Block Sizes	Non-power-of-2 block sizes for better SM utilization

Key Functions in the DeepGEMM Codebase

Function	Purpose
`fp8_gemm_kernel`	Core GEMM implementation (~300 lines)
`wgmma`	Wrapper for tensor core instructions
`tma_copy`	TMA data loading function
`make_smem_desc`	Creates shared memory descriptors for tensor operations
`get_best_configs`	Determines optimal parameters for a given problem size
`interleave_ffma.process`	Post-processes binary to optimize instruction scheduling
`set_num_sms`/`get_num_sms`	Control/query maximum SMs to use

This cheatsheet covers the key concepts needed to understand the DeepGEMM repository's implementation of high-performance FP8 matrix multiplications on NVIDIA Hopper GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment