Ampere (GA10x GPU):
6144 KB L2 Cache (12 32-bit memory controllers (384-bit total),
512 KB of L2 cache is paired with each 32-bit memory controller)
Each SM: 128 CUDA Cores, 4 3rd-generation Tensor Cores, a 256 KB Register File, 128 KB of L1/Shared Memory.
each SM has 4 partitions (a 64 KB Register File, one 3rd-generation Tensor Core, an L0 instruction cache, one warp scheduler, one dispatch unit, and sets of math and other units). The four partitions share a combined 128 KB L1 data cache/shared memory subsystem.
Turing and Volta SMs support concurrent execution of FP32 and INT32 operations