cf. CUDA C++ Programming Guide / §16. Compute Capabilities / §16.2. Features and Technical Specifications
| NVIDIA GPU | deviceQuery output |
|---|---|
| T4 | t4.txt |
| H100 | h100.txt |
| H200 | h200.txt |
| GeForce RTX 3070 | rtx3070.txt |
cf. CUDA C++ Programming Guide / §16. Compute Capabilities / §16.2. Features and Technical Specifications
| NVIDIA GPU | deviceQuery output |
|---|---|
| T4 | t4.txt |
| H100 | h100.txt |
| H200 | h200.txt |
| GeForce RTX 3070 | rtx3070.txt |
| ./deviceQuery Starting... | |
| CUDA Device Query (Runtime API) version (CUDART static linking) | |
| Detected 1 CUDA Capable device(s) | |
| Device 0: "NVIDIA H100 PCIe" | |
| CUDA Driver Version / Runtime Version 12.8 / 12.8 | |
| CUDA Capability Major/Minor version number: 9.0 | |
| Total amount of global memory: 81090 MBytes (85029158912 bytes) | |
| (114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores | |
| GPU Max Clock rate: 1755 MHz (1.75 GHz) | |
| Memory Clock rate: 1593 Mhz | |
| Memory Bus Width: 5120-bit | |
| L2 Cache Size: 52428800 bytes | |
| Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) | |
| Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers | |
| Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers | |
| Total amount of constant memory: 65536 bytes | |
| Total amount of shared memory per block: 49152 bytes | |
| Total shared memory per multiprocessor: 233472 bytes | |
| Total number of registers available per block: 65536 | |
| Warp size: 32 | |
| Maximum number of threads per multiprocessor: 2048 | |
| Maximum number of threads per block: 1024 | |
| Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | |
| Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) | |
| Maximum memory pitch: 2147483647 bytes | |
| Texture alignment: 512 bytes | |
| Concurrent copy and kernel execution: Yes with 3 copy engine(s) | |
| Run time limit on kernels: No | |
| Integrated GPU sharing Host Memory: No | |
| Support host page-locked memory mapping: Yes | |
| Alignment requirement for Surfaces: Yes | |
| Device has ECC support: Enabled | |
| Device supports Unified Addressing (UVA): Yes | |
| Device supports Managed Memory: Yes | |
| Device supports Compute Preemption: Yes | |
| Supports Cooperative Kernel Launch: Yes | |
| Supports MultiDevice Co-op Kernel Launch: Yes | |
| Device PCI Domain ID / Bus ID / location ID: 0 / 22 / 0 | |
| Compute Mode: | |
| < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | |
| deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.8, NumDevs = 1 | |
| Result = PASS |
| ./deviceQuery Starting... | |
| CUDA Device Query (Runtime API) version (CUDART static linking) | |
| Detected 1 CUDA Capable device(s) | |
| Device 0: "NVIDIA GH200 120GB" | |
| CUDA Driver Version / Runtime Version 12.4 / 12.6 | |
| CUDA Capability Major/Minor version number: 9.0 | |
| Total amount of global memory: 96768 MBytes (101468602368 bytes) | |
| (132) Multiprocessors, (128) CUDA Cores/MP: 16896 CUDA Cores | |
| GPU Max Clock rate: 1980 MHz (1.98 GHz) | |
| Memory Clock rate: 2619 Mhz | |
| Memory Bus Width: 6144-bit | |
| L2 Cache Size: 62914560 bytes | |
| Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) | |
| Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers | |
| Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers | |
| Total amount of constant memory: 65536 bytes | |
| Total amount of shared memory per block: 49152 bytes | |
| Total shared memory per multiprocessor: 233472 bytes | |
| Total number of registers available per block: 65536 | |
| Warp size: 32 | |
| Maximum number of threads per multiprocessor: 2048 | |
| Maximum number of threads per block: 1024 | |
| Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | |
| Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) | |
| Maximum memory pitch: 2147483647 bytes | |
| Texture alignment: 512 bytes | |
| Concurrent copy and kernel execution: Yes with 2 copy engine(s) | |
| Run time limit on kernels: No | |
| Integrated GPU sharing Host Memory: No | |
| Support host page-locked memory mapping: Yes | |
| Alignment requirement for Surfaces: Yes | |
| Device has ECC support: Enabled | |
| Device supports Unified Addressing (UVA): Yes | |
| Device supports Managed Memory: Yes | |
| Device supports Compute Preemption: Yes | |
| Supports Cooperative Kernel Launch: Yes | |
| Supports MultiDevice Co-op Kernel Launch: Yes | |
| Device PCI Domain ID / Bus ID / location ID: 9 / 1 / 0 | |
| Compute Mode: | |
| < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | |
| deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.6, NumDevs = 1 | |
| Result = PASS |
| ./deviceQuery Starting... | |
| CUDA Device Query (Runtime API) version (CUDART static linking) | |
| Detected 1 CUDA Capable device(s) | |
| Device 0: "NVIDIA GeForce RTX 3070" | |
| CUDA Driver Version / Runtime Version 12.9 / 12.9 | |
| CUDA Capability Major/Minor version number: 8.6 | |
| Total amount of global memory: 7840 MBytes (8220901376 bytes) | |
| (046) Multiprocessors, (128) CUDA Cores/MP: 5888 CUDA Cores | |
| GPU Max Clock rate: 1755 MHz (1.75 GHz) | |
| Memory Clock rate: 7001 Mhz | |
| Memory Bus Width: 256-bit | |
| L2 Cache Size: 4194304 bytes | |
| Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) | |
| Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers | |
| Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers | |
| Total amount of constant memory: 65536 bytes | |
| Total amount of shared memory per block: 49152 bytes | |
| Total shared memory per multiprocessor: 102400 bytes | |
| Total number of registers available per block: 65536 | |
| Warp size: 32 | |
| Maximum number of threads per multiprocessor: 1536 | |
| Maximum number of threads per block: 1024 | |
| Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | |
| Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) | |
| Maximum memory pitch: 2147483647 bytes | |
| Texture alignment: 512 bytes | |
| Concurrent copy and kernel execution: Yes with 2 copy engine(s) | |
| Run time limit on kernels: No | |
| Integrated GPU sharing Host Memory: No | |
| Support host page-locked memory mapping: Yes | |
| Alignment requirement for Surfaces: Yes | |
| Device has ECC support: Disabled | |
| Device supports Unified Addressing (UVA): Yes | |
| Device supports Managed Memory: Yes | |
| Device supports Compute Preemption: Yes | |
| Supports Cooperative Kernel Launch: Yes | |
| Supports MultiDevice Co-op Kernel Launch: Yes | |
| Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 | |
| Compute Mode: | |
| < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | |
| deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.9, CUDA Runtime Version = 12.9, NumDevs = 1 | |
| Result = PASS |
| ./deviceQuery Starting... | |
| CUDA Device Query (Runtime API) version (CUDART static linking) | |
| Detected 1 CUDA Capable device(s) | |
| Device 0: "Tesla T4" | |
| CUDA Driver Version / Runtime Version 12.2 / 11.8 | |
| CUDA Capability Major/Minor version number: 7.5 | |
| Total amount of global memory: 15102 MBytes (15835660288 bytes) | |
| (040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores | |
| GPU Max Clock rate: 1590 MHz (1.59 GHz) | |
| Memory Clock rate: 5001 Mhz | |
| Memory Bus Width: 256-bit | |
| L2 Cache Size: 4194304 bytes | |
| Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) | |
| Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers | |
| Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers | |
| Total amount of constant memory: 65536 bytes | |
| Total amount of shared memory per block: 49152 bytes | |
| Total shared memory per multiprocessor: 65536 bytes | |
| Total number of registers available per block: 65536 | |
| Warp size: 32 | |
| Maximum number of threads per multiprocessor: 1024 | |
| Maximum number of threads per block: 1024 | |
| Max dimension size of a thread block (x,y,z): (1024, 1024, 64) | |
| Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) | |
| Maximum memory pitch: 2147483647 bytes | |
| Texture alignment: 512 bytes | |
| Concurrent copy and kernel execution: Yes with 3 copy engine(s) | |
| Run time limit on kernels: No | |
| Integrated GPU sharing Host Memory: No | |
| Support host page-locked memory mapping: Yes | |
| Alignment requirement for Surfaces: Yes | |
| Device has ECC support: Enabled | |
| Device supports Unified Addressing (UVA): Yes | |
| Device supports Managed Memory: Yes | |
| Device supports Compute Preemption: Yes | |
| Supports Cooperative Kernel Launch: Yes | |
| Supports MultiDevice Co-op Kernel Launch: Yes | |
| Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 4 | |
| Compute Mode: | |
| < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > | |
| deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 11.8, NumDevs = 1 | |
| Result = PASS |