This tool is designed to run simple kernels that are almost exclusively math instructions, in order to test ALU throughput on compatible CUDA devices.
In particular, it was designed to text FP16 throughput on sm_53 and upwards device architectures, where FP16 support is present. FP16, FP32 and FP64 kernels are run and measured.
The kernels themselves are simple and execute almost nothing more than long chains of dependent FMA instructions, bar some setup at the top of each kernel to setup a thread index, and a small amount of math at the end to consume the results of the FMA chain and write it out to memory.
In the current version, each thread runs a chain 1024 FMAs long, run as 8 loops of 128. For the FP16 and FP32 kernels, 8192x4096 threads are run, split up as blocks run at the maximum number of device threads the hardware reports it can handle. For the FP64 kernels, one quarter of the number of threads are run, to make the test run faster more than anything.