Skip to content

Instantly share code, notes, and snippets.

@rys
Created May 31, 2016 23:05
Show Gist options
  • Save rys/f427c0a85fcc367087c40fd8ffbdccb7 to your computer and use it in GitHub Desktop.
Save rys/f427c0a85fcc367087c40fd8ffbdccb7 to your computer and use it in GitHub Desktop.
Beyond3D CUDA Throughput Thingy

Beyond3D CUDA Throughput Thingy

This tool is designed to run simple kernels that are almost exclusively math instructions, in order to test ALU throughput on compatible CUDA devices.

In particular, it was designed to text FP16 throughput on sm_53 and upwards device architectures, where FP16 support is present. FP16, FP32 and FP64 kernels are run and measured.

The kernels themselves are simple and execute almost nothing more than long chains of dependent FMA instructions, bar some setup at the top of each kernel to setup a thread index, and a small amount of math at the end to consume the results of the FMA chain and write it out to memory.

In the current version, each thread runs a chain 1024 FMAs long, run as 8 loops of 128. For the FP16 and FP32 kernels, 8192x4096 threads are run, split up as blocks run at the maximum number of device threads the hardware reports it can handle. For the FP64 kernels, one quarter of the number of threads are run, to make the test run faster more than anything.

Each test uses device intrinsics to feed the compiler front-end, to give it the best chance at codegen.

Kernel Details for sm_61 targets

For the FP32 kernel, the generated PTX is mostly fma.f32 instructions. That assembles to the FFMA hardware instruction.

For the FP64 kernel, the generated PTX is mostly fma.f64 instructions. That assembles to the DFMA hardware instruction.

For the FP16 kernel, the generated PTX is mostly fma.f16x2 instructions. That assembles to the vec2 HFMA2 hardware instruction.

Notes

There are no barriers between instructions in any kernel, so the limiting factors are ALU throughput and scheduling. The tests aren't designed to achieve peak throughput due to the crude tuning of kernel dimensions and the loop count. Rather, they exist to show the rough difference in relative throughput between each instruction type.

Data

On a GeForce GTX 1080, vec2 FP16 throughput is around 1/6th of FP32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment