CUDA GPU Compute Capability - Compatibility

Note

Find the NVIDIA GPU products matching the compute capability versions below here:
- compute capability ≥ 7.5 (≥ sm_75)
- compute capability < 7.5 (< sm_75)
Find me at bit.ly/cuda-gencode-compat-matrix

CUDA Terminology Reference

Term	Description
SM	"The chip" - more specifically the GPU's Streaming Multiprocessors on which CUDA runs
SM architecture	A chip of a specific version with a certain set of capabilities (ex: `sm_120`, `sm_121`)
SM family	Similar chips, sharing a set of capabilities, and a major version (ex: `sm_9x`, `sm_12x`)
Compute Capability	Set of features that can run on chips. Each new chip comes with a new compute capability for it of the same version (ex: `12.1` for `sm_121`), though older compute capabilities might be compatible to run on newer SMs (more below)
CUDA device code	Your CUDA GPU code - including kernels and device functions - that leverages features of a certain Compute Capability and needs to be compiled to run on an SM
ISA	Instruction set architectures, or your code once compiled - two types defined below
Virtual ISA	Intermediary ISA for a given compute capability, that still need to be translated for a specific SM to execute
PTX	The one and only virtual ISA format and file
Real ISA	Take PTX, translate it for a specific SM, you get a real ISA that a SM can now execute
cubin	The file representation of the real ISA
SASS	Streaming Assembler - the pre-blackwell real ISA type
Offline compilation	Mechanism that, given specific compute capabilities, can generate PTX from CUDA device code, cubin from that PTX, and join them in a fatbinary, all on a build system that is not required to host the target SM(s).
NVCC	The offline compiler (executable)
Runtime compilation	Mechanism that, given a specific compute capability, generates PTX from CUDA device code dynamically at runtime.
NVRTC	The runtime compiler (library)
JIT compilation	Mechanism that generates the most adequate cubin for a target SM using PTX as input (from NVCC or NVRTC). This is done during the application startup (Just In Time) by the CUDA driver if there are no prebuilt cubin compatible for the current SM. JIT incurs a startup cost but can provides extra forward compatibility (more below)

Compatibility across SMs

Base compute capabilities have existed since the beginning of CUDA. Architecture then family specific compute capabilities where introduced in CUDA 12.0 and 12.9 respectively to enable features not included in base compute capabilities, but with more limited forward compatibility (Reference):

Base: <major>.<minor>
- cubin: compatible to run on sm_<major><y> where y >= minor (i.e. any future sm of the same family)
- ptx: same as cubin + compatible to build & run on sm_<x><y> where x > major (i.e. any future sm)
Family-specific: <major>.<minor>f
- cubin: compatible to run on sm_<major><y> where y >= minor (i.e. any future sm of the same family)
- ptx: same as cubin (i.e. any future sm of the same family)
Architecture-specific: <major>.<minor>a
- cubin: compatible to run/build on sm_<major><minor> only (i.e. no other sm)
- ptx: same as cubin (i.e. no other sm)

Tip

Prebuild cubin for the specific chips you are targeting to extract all their features
- When available, use family-specific for cubin (ex: sm_120f) instead of base compute capability (e.g sm_120) since you'll get potentially extra features with the same forward compatibility
- Only use architecture-specific for cubin (ex: sm_120a) if you really need the specific feature this unlocks on that specific chip, as you won't get compatibility for future chips of that family
If optimizing the size of your fat binaries is important, research whether any given compute capability version provides required features you want to leverage on that chip, or if it is sufficient to rely on the compatibility from the major version or lower minor version you might already build for. You can also ensure to only enable building a certain compute capability for the specific cuda kernels requiring that feature, instead of globally for all cuda kernels in a project/build. E.g.:
- 8.9 does not add much to 8.6 apart from fp8 support.
- 10.1 is deprecated in favor of 11.0 for Thor.
- 12.1 is the exact same as 12.0 since the only difference is the physically integrated CPU+GPU memory of Spark (sm_121) compared to (sm_120), for which there are no current kernel optimizations.
- 3.2 (K1), 5.3 (Nano), 6.2 (TX2) 7.2 (Xavier), 8.7 (Orin), 8.8 (Switch2), 10.1 (Thor < cu13), 11.0 (Thor ≥ 13) are Jetson/Tegra, so never of use for x86_64, nor needed if your aarch64 builds only support datacenter chips (sbsa) - and vice-versa.
Build ptx - of base compute compatibility - either:
- for the latest chip you are targeting - for forward compatibility with future chips with compute capability higher than your highest cubin build
- and/or the oldest chip you want to support - for less-performant compatibility with older chips than your lowest cubin build. In that scenario, you could even skip any cubin build if performance and JIT compilation are acceptable.
Unless you have reasons to not want to distribute any cubin, there are no reasons to retain and distribute ptx for family or architecture specific compute capabilities, as they do not offer more compatibility than cubin and require JIT compilation.

Compatibility across CUDA Toolkit versions

Note

This table was generated on 2025-08-09 using cuda-system-utils/get_nvcc_sm_supported_versions.py. It extracts that information by parsing the --help output of each version of nvcc found.

curl -sSL https://raw.githubusercontent.com/agirault/cuda-system-utils/refs/heads/main/scripts/get_nvcc_sm_supported_versions.py | python3 - -s

CUDA Ver \ SM Arch	sm_30	sm_32	sm_35	sm_37	sm_50	sm_52	sm_53	sm_60	sm_61	sm_62	sm_70	sm_72	sm_75	sm_80	sm_86	sm_87	sm_88	sm_89	sm_90	sm_90a	sm_100	sm_100a	sm_100f	sm_101	sm_101a	sm_101f	sm_103	sm_103a	sm_103f	sm_110	sm_110a	sm_110f	sm_120	sm_120a	sm_120f	sm_121	sm_121a	sm_121f
13.0													X	X	X	X	X	X	X	X	X	X	X				X	X	X	X	X	X	X	X	X	X	X	X
12.9					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X	X	X	X	X	X	X	X	X	X				X	X	X	X	X	X
12.8					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X	X	X		X	X								X	X
12.6					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.5					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.4					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.3					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.2					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.1					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
12.0					X	X	X	X	X	X	X	X	X	X	X	X		X	X	X
11.8			X	X	X	X	X	X	X	X	X	X	X	X	X	X		X	X
11.7			X	X	X	X	X	X	X	X	X	X	X	X	X	X
11.6			X	X	X	X	X	X	X	X	X	X	X	X	X	X
11.5			X	X	X	X	X	X	X	X	X	X	X	X	X	X
11.4			X	X	X	X	X	X	X	X	X	X	X	X	X	X
11.3			X	X	X	X	X	X	X	X	X	X	X	X	X
11.2			X	X	X	X	X	X	X	X	X	X	X	X	X
11.1			X	X	X	X	X	X	X	X	X	X	X	X	X
11.0			X	X	X	X	X	X	X	X	X	X	X	X
10.2	X	X	X	X	X	X	X	X	X	X	X	X	X
10.1	X	X	X	X	X	X	X	X	X	X	X	X	X
10.0	X	X	X	X	X	X	X	X	X	X	X	X	X
9.2	X	X	X	X	X	X	X	X	X	X	X	X
9.1	X	X	X	X	X	X	X	X	X	X	X	X

How to build CUDA device code

With NVCC

You can tell NVCC what to build and package like so:

-gencode compute=...,code=...

compute is to define the PTX version to build. Its value is compute_<CC> where CC is your compute capability version with no period, where architecture and family specific versions are allowed (ex: compute_90, compute_100a, compute_100f).
code is to define the type of object that is embedded in the output:
1. PTX: you can embed the PTX you built just above by passing the same value (compute_<CC>) to this segment - this skips offline compilation
2. cubin: you can translate the PTX you built just above to cubin by passing a value in the form of sm_<CC> where CC is your compute capability version with no period - this will then embed that cubin, dropping the intermediary PTX from the previous segment
You can use the -gencode flag multiple times, once for each binary type and version you want to embed in your fat binary

Example:

nvcc -c kernels.cu \
  -gencode compute=compute_86,code=sm_86 \         # This embeds cubin for Ampere. We'll skip Ada (sm_89) since we don't need fp8
  -gencode compute=compute_90,code=sm_90 \         # This embeds cubin for Hopper. No need for 9.0a, we don't use CUTLASS's accelerated features: https://docs.nvidia.com/cutlass/media/docs/cpp/functionality.html
  -gencode compute=compute_100f,code=sm_100 \      # This embeds cubin for Blackwell B200 chips. Enabling family-specific features \
  -gencode compute=compute_120f,code=sm_120 \      # This embeds cubin for other Blackwell chips. Enabling family-specific features
  -gencode compute=compute_120,code=compute_120    # This embeds the PTX with 12.0 capability for maximum forward compatibility

With CMake

TODO

agirault/cuda_sm_compatibility_matrix.md