Term | Description |
---|---|
SM (Streaming Multiprocessor) | Computational unit in NVIDIA GPUs; Hopper has up to 132 SMs |
Warp | Group of 32 threads that execute in lockstep |
Thread Block | Group of threads that execute on the same SM and can synchronize |
Grid | Collection of thread blocks that execute the same kernel |
Shared Memory | Fast memory shared by threads in a block (L1 cache) |
L2 Cache | Shared by all SMs, larger but slower than shared memory |
Occupancy | Ratio of active warps to maximum possible warps on an SM |
Kernel | GPU function executed by many threads in parallel |
CUDA Core | Scalar processor for general-purpose operations |
Tensor Core | Specialized hardware for matrix operations |
Feature | Description |
---|---|
FP8 Tensor Cores | Hardware acceleration for FP8 matrix operations with E4M3 and E5M2 formats |
TMA (Tensor Memory Accelerator) | Asynchronous data movement engine for efficiently loading/storing tensors |
TMA Multicast | Ability to send the same data to multiple locations in shared memory |
TMA Descriptor | Data structure describing layout and shape of tensors for TMA operations |
WGMMA (Warp Group Matrix Multiply-Accumulate) | New tensor core instruction format used by multiple warps |
Warp Group | Collection of 4 warps (128 threads) that can cooperate on tensor operations |
Warp Specialization | Technique where different warps perform different tasks (compute, load, store) |
Term | Description |
---|---|
CUTLASS | CUDA Templates for Linear Algebra Subroutines - composable building blocks for GEMM |
CuTe | CUDA Tensor Extensions - layouts and abstractions for tensor operations |
Collective | Pattern for coordinating threads to perform a complex operation |
Tile | Partitioning of data for processing by blocks/warps/threads |
MMA (Matrix Multiply-Accumulate) | Core operation in dense linear algebra |
Epilogue | Post-processing after matrix multiplication (scaling, bias addition, activation) |
Prologue | Pre-processing before matrix multiplication |
Threadblock Rasterization | Scheduling pattern of thread blocks for better locality |
Technique | Description |
---|---|
Double Buffering/Multi-Stage Pipeline | Using multiple buffers to overlap computation and data loading |
Swizzling | Memory layout transformation to improve memory access patterns |
Register Reconfiguration | Reallocating register file usage for different thread groups |
Persistent Kernel | Long-running kernel that processes multiple tiles without kernel relaunch |
Barriers | Synchronization primitives (NamedBarrier , ClusterBarrier ) |
STMATRIX | Matrix storage instruction for shared memory |
Concept | Description |
---|---|
PTX | Parallel Thread Execution - intermediate representation for CUDA |
SASS | GPU assembly language (final compiled code) |
FFMA | Fused Multiply-Add operation (a*b+c in one instruction) |
FFMA Interleaving | Optimization technique that modifies instruction scheduling |
yield and reuse bits | Control bits in SASS that affect warp scheduling and register reuse |
JIT (Just-In-Time) Compilation | Runtime compilation for specialized code |
__forceinline__ |
Directive to always inline a function |
__launch_bounds__ |
Directive specifying max threads per block and min blocks per SM |
Term | Description |
---|---|
FP8 E4M3 | 8-bit floating point format with 4 exponent bits and 3 mantissa bits |
FP8 E5M2 | 8-bit floating point format with 5 exponent bits and 2 mantissa bits |
Fine-grained Scaling | Per-channel or per-row scaling factors for FP8 precision |
Promotion | Converting FP8 to higher precision (BF16/FP32) for accumulation |
GEMM | General Matrix Multiplication (C = A×B) |
MMA Instruction | Matrix-multiply-accumulate instruction executed by tensor cores |
NT Format | Non-transposed first matrix, transposed second matrix |
Two-level Accumulation | Using both tensor cores and CUDA cores for precise accumulation |
Concept | Description |
---|---|
Normal GEMM | Standard matrix multiplication |
Grouped Contiguous GEMM | Multiplication where inputs are concatenated into a single tensor |
Grouped Masked GEMM | Multiplication where a mask indicates which parts to compute |
BLOCK_M/N/K | Thread block tile sizes in M, N, K dimensions |
M Alignment | Alignment requirement for dimension M (typically 128) |
TMA Alignment | Memory alignment for Tensor Memory Accelerator (16 bytes) |
Cell Div | Ceiling division utility ((a + b - 1) / b ) |
Unaligned Block Sizes | Non-power-of-2 block sizes for better SM utilization |
Function | Purpose |
---|---|
fp8_gemm_kernel |
Core GEMM implementation (~300 lines) |
wgmma |
Wrapper for tensor core instructions |
tma_copy |
TMA data loading function |
make_smem_desc |
Creates shared memory descriptors for tensor operations |
get_best_configs |
Determines optimal parameters for a given problem size |
interleave_ffma.process |
Post-processes binary to optimize instruction scheduling |
set_num_sms /get_num_sms |
Control/query maximum SMs to use |
This cheatsheet covers the key concepts needed to understand the DeepGEMM repository's implementation of high-performance FP8 matrix multiplications on NVIDIA Hopper GPUs.