This tool is designed to run simple kernels that are almost exclusively math instructions, in order to test ALU throughput on compatible CUDA devices.
In particular, it was designed to text FP16 throughput on sm_53 and upwards device architectures, where FP16 support is present. FP16, FP32 and FP64 kernels are run and measured.
The kernels themselves are simple and execute almost nothing more than long chains of dependent FMA instructions, bar some setup at the top of each kernel to setup a thread index, and a small amount of math at the end to consume the results of the FMA chain and write it out to memory.
In the current version, each thread runs a chain 1024 FMAs long, run as 8 loops of 128. For the FP16 and FP32 kernels, 8192x4096 threads are run, split up as blocks run at the maximum number of device threads the hardware reports it can handle. For the FP64 kernels, one quarter of the number of threads are run, to make the test run faster more than anything.
Each test uses device intrinsics to feed the compiler front-end, to give it the best chance at codegen.
For the FP32 kernel, the generated PTX is mostly fma.f32
instructions. That assembles to the FFMA
hardware instruction.
For the FP64 kernel, the generated PTX is mostly fma.f64
instructions. That assembles to the DFMA
hardware instruction.
For the FP16 kernel, the generated PTX is mostly fma.f16x2
instructions. That assembles to the vec2 HFMA2
hardware instruction.
There are no barriers between instructions in any kernel, so the limiting factors are ALU throughput and scheduling. The tests aren't designed to achieve peak throughput due to the crude tuning of kernel dimensions and the loop count. Rather, they exist to show the rough difference in relative throughput between each instruction type.
On a GeForce GTX 1080, vec2 FP16 throughput is around 1/6th of FP32.