Skip to content

Instantly share code, notes, and snippets.

@agirault
Last active October 7, 2025 15:46
Show Gist options
  • Save agirault/d0d66c12c85dbc2662c95569ef039a1d to your computer and use it in GitHub Desktop.
Save agirault/d0d66c12c85dbc2662c95569ef039a1d to your computer and use it in GitHub Desktop.
CUDA GPU Compute Capability - Compatibility

CUDA GPU Compute Capability - Compatibility

Note

CUDA Terminology Reference

Term Description
SM "The chip" - more specifically the GPU's Streaming Multiprocessors on which CUDA runs
SM architecture A chip of a specific version with a certain set of capabilities (ex: sm_120, sm_121)
SM family Similar chips, sharing a set of capabilities, and a major version (ex: sm_9x, sm_12x)
Compute Capability Set of features that can run on chips. Each new chip comes with a new compute capability for it of the same version (ex: 12.1 for sm_121), though older compute capabilities might be compatible to run on newer SMs (more below)
CUDA device code Your CUDA GPU code - including kernels and device functions - that leverages features of a certain Compute Capability and needs to be compiled to run on an SM
ISA Instruction set architectures, or your code once compiled - two types defined below
Virtual ISA Intermediary ISA for a given compute capability, that still need to be translated for a specific SM to execute
PTX The one and only virtual ISA format and file
Real ISA Take PTX, translate it for a specific SM, you get a real ISA that a SM can now execute
cubin The file representation of the real ISA
SASS Streaming Assembler - the pre-blackwell real ISA type
Offline compilation Mechanism that, given specific compute capabilities, can generate PTX from CUDA device code, cubin from that PTX, and join them in a fatbinary, all on a build system that is not required to host the target SM(s).
NVCC The offline compiler (executable)
Runtime compilation Mechanism that, given a specific compute capability, generates PTX from CUDA device code dynamically at runtime.
NVRTC The runtime compiler (library)
JIT compilation Mechanism that generates the most adequate cubin for a target SM using PTX as input (from NVCC or NVRTC). This is done during the application startup (Just In Time) by the CUDA driver if there are no prebuilt cubin compatible for the current SM. JIT incurs a startup cost but can provides extra forward compatibility (more below)

Compatibility across SMs

Base compute capabilities have existed since the beginning of CUDA. Architecture then family specific compute capabilities where introduced in CUDA 12.0 and 12.9 respectively to enable features not included in base compute capabilities, but with more limited forward compatibility (Reference):

  • Base: <major>.<minor>
    • cubin: compatible to run on sm_<major><y> where y >= minor (i.e. any future sm of the same family)
    • ptx: same as cubin + compatible to build & run on sm_<x><y> where x > major (i.e. any future sm)
  • Family-specific: <major>.<minor>f
    • cubin: compatible to run on sm_<major><y> where y >= minor (i.e. any future sm of the same family)
    • ptx: same as cubin (i.e. any future sm of the same family)
  • Architecture-specific: <major>.<minor>a
    • cubin: compatible to run/build on sm_<major><minor> only (i.e. no other sm)
    • ptx: same as cubin (i.e. no other sm)

Tip

  1. Prebuild cubin for the specific chips you are targeting to extract all their features
    • When available, use family-specific for cubin (ex: sm_120f) instead of base compute capability (e.g sm_120) since you'll get potentially extra features with the same forward compatibility
    • Only use architecture-specific for cubin (ex: sm_120a) if you really need the specific feature this unlocks on that specific chip, as you won't get compatibility for future chips of that family
  2. If optimizing the size of your fat binaries is important, research whether any given compute capability version provides required features you want to leverage on that chip, or if it is sufficient to rely on the compatibility from the major version or lower minor version you might already build for. You can also ensure to only enable building a certain compute capability for the specific cuda kernels requiring that feature, instead of globally for all cuda kernels in a project/build. E.g.:
    • 8.9 does not add much to 8.6 apart from fp8 support.
    • 10.1 is deprecated in favor of 11.0 for Thor.
    • 12.1 is the exact same as 12.0 since the only difference is the physically integrated CPU+GPU memory of Spark (sm_121) compared to (sm_120), for which there are no current kernel optimizations.
    • 3.2 (K1), 5.3 (Nano), 6.2 (TX2) 7.2 (Xavier), 8.7 (Orin), 8.8 (Switch2), 10.1 (Thor < cu13), 11.0 (Thor ≥ 13) are Jetson/Tegra, so never of use for x86_64, nor needed if your aarch64 builds only support datacenter chips (sbsa) - and vice-versa.
  3. Build ptx - of base compute compatibility - either:
    • for the latest chip you are targeting - for forward compatibility with future chips with compute capability higher than your highest cubin build
    • and/or the oldest chip you want to support - for less-performant compatibility with older chips than your lowest cubin build. In that scenario, you could even skip any cubin build if performance and JIT compilation are acceptable.
  4. Unless you have reasons to not want to distribute any cubin, there are no reasons to retain and distribute ptx for family or architecture specific compute capabilities, as they do not offer more compatibility than cubin and require JIT compilation.

Compatibility across CUDA Toolkit versions

Note

This table was generated on 2025-08-09 using cuda-system-utils/get_nvcc_sm_supported_versions.py. It extracts that information by parsing the --help output of each version of nvcc found.

curl -sSL https://raw.githubusercontent.com/agirault/cuda-system-utils/refs/heads/main/scripts/get_nvcc_sm_supported_versions.py | python3 - -s
CUDA Ver \ SM Arch sm_30 sm_32 sm_35 sm_37 sm_50 sm_52 sm_53 sm_60 sm_61 sm_62 sm_70 sm_72 sm_75 sm_80 sm_86 sm_87 sm_88 sm_89 sm_90 sm_90a sm_100 sm_100a sm_100f sm_101 sm_101a sm_101f sm_103 sm_103a sm_103f sm_110 sm_110a sm_110f sm_120 sm_120a sm_120f sm_121 sm_121a sm_121f
13.0 X X X X X X X X X X X X X X X X X X X X X X X
12.9 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
12.8 X X X X X X X X X X X X X X X X X X X X X
12.6 X X X X X X X X X X X X X X X
12.5 X X X X X X X X X X X X X X X
12.4 X X X X X X X X X X X X X X X
12.3 X X X X X X X X X X X X X X X
12.2 X X X X X X X X X X X X X X X
12.1 X X X X X X X X X X X X X X X
12.0 X X X X X X X X X X X X X X X
11.8 X X X X X X X X X X X X X X X X
11.7 X X X X X X X X X X X X X X
11.6 X X X X X X X X X X X X X X
11.5 X X X X X X X X X X X X X X
11.4 X X X X X X X X X X X X X X
11.3 X X X X X X X X X X X X X
11.2 X X X X X X X X X X X X X
11.1 X X X X X X X X X X X X X
11.0 X X X X X X X X X X X X
10.2 X X X X X X X X X X X X X
10.1 X X X X X X X X X X X X X
10.0 X X X X X X X X X X X X X
9.2 X X X X X X X X X X X X
9.1 X X X X X X X X X X X X

How to build CUDA device code

With NVCC

You can tell NVCC what to build and package like so:

-gencode compute=...,code=...
  • compute is to define the PTX version to build. Its value is compute_<CC> where CC is your compute capability version with no period, where architecture and family specific versions are allowed (ex: compute_90, compute_100a, compute_100f).
  • code is to define the type of object that is embedded in the output:
    1. PTX: you can embed the PTX you built just above by passing the same value (compute_<CC>) to this segment - this skips offline compilation
    2. cubin: you can translate the PTX you built just above to cubin by passing a value in the form of sm_<CC> where CC is your compute capability version with no period - this will then embed that cubin, dropping the intermediary PTX from the previous segment
  • You can use the -gencode flag multiple times, once for each binary type and version you want to embed in your fat binary

Example:

nvcc -c kernels.cu \
  -gencode compute=compute_86,code=sm_86 \         # This embeds cubin for Ampere. We'll skip Ada (sm_89) since we don't need fp8
  -gencode compute=compute_90,code=sm_90 \         # This embeds cubin for Hopper. No need for 9.0a, we don't use CUTLASS's accelerated features: https://docs.nvidia.com/cutlass/media/docs/cpp/functionality.html
  -gencode compute=compute_100f,code=sm_100 \      # This embeds cubin for Blackwell B200 chips. Enabling family-specific features \
  -gencode compute=compute_120f,code=sm_120 \      # This embeds cubin for other Blackwell chips. Enabling family-specific features
  -gencode compute=compute_120,code=compute_120    # This embeds the PTX with 12.0 capability for maximum forward compatibility

With CMake

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment