Skip to content

Instantly share code, notes, and snippets.

@SimonDanisch
Last active November 3, 2017 12:23
Show Gist options
  • Save SimonDanisch/3f4d64adf3b17a326d4b97c38f44c641 to your computer and use it in GitHub Desktop.
Save SimonDanisch/3f4d64adf3b17a326d4b97c38f44c641 to your computer and use it in GitHub Desktop.

PDE

Kuramoto-Sivashinsky algorithm benchmark (original benchmark).

This benchmark is dominated by the cost of the FFT, leading to worse results for OpenCL with CLFFT compared to the faster CUFFT. Similarly the multithreaded backend doesn't improve much over base with the same FFT implementation. Result of the benchmarked PDE:

PDE

device N = 10¹ N = 10⁷
cuarrays cuarrays 0.0013 s 0.0069x 1.7 s 16.4x
clarrays gpu clarrays gpu 0.0128 s 0.0007x 3.8 s 7.3x
clarrays cpu clarrays cpu 0.0244 s 0.0004x 16.8 s 1.6x
gpuarrays threaded gpuarrays threaded 0.0002 s 0.0531x 22.5 s 1.2x
julia base julia base 8.719e-6 s 1.0x 27.7 s 1.0x

code


Juliaset

Juliaset

device N = 2¹² N = 2²⁴
clarrays gpu clarrays gpu 0.0356 ms 2.3x 2.9 ms 96.5x
cuarrays cuarrays 0.0099 ms 8.3x 4.2 ms 66.7x
clarrays cpu clarrays cpu 0.0495 ms 1.7x 31.9 ms 8.7x
gpuarrays threaded gpuarrays threaded 0.0337 ms 2.4x 109.4 ms 2.5x
julia base julia base 0.0821 ms 1.0x 278.8 ms 1.0x

code


Blackscholes

Blackschole is a nice benchmark for broadcasting performance. It's a medium heavy calculation per array element, where the calculation is completely independant from each other.

Blackscholes

device N = 10¹ N = 10⁷
arrayfire cu arrayfire cu 0.1 ms 0.0075x 2.7 ms 301.2x
cuarrays cuarrays 0.0087 ms 0.1x 2.8 ms 288.5x
clarrays gpu clarrays gpu 0.0419 ms 0.0222x 2.9 ms 280.5x
arrayfire cl arrayfire cl 0.1 ms 0.0076x 3.2 ms 250.7x
clarrays cpu clarrays cpu 0.048 ms 0.0194x 14.8 ms 53.9x
gpuarrays threaded gpuarrays threaded 0.0015 ms 0.6x 173.8 ms 4.6x
julia base julia base 0.0009 ms 1.0x 800.0 ms 1.0x

code


Poincare

Poincare section of a chaotic neuronal network. The domination of OpenCL in this benchmark might be due to a better use of vector intrinsics in Transpiler.jl, but needs some more investigations. Result of calculation:

Poincare

device N = 10³ N = 10⁹
clarrays gpu clarrays gpu 4.928e-5 s 0.0827x 0.1 s 303.1x
clarrays cpu clarrays cpu 5.2626e-5 s 0.0775x 1.3 s 34.1x
gpuarrays threaded gpuarrays threaded 0.0003 s 0.0126x 7.3 s 6.1x
julia base julia base 4.078e-6 s 1.0x 44.4 s 1.0x

code


Sum

Note that sum is implemented in GPUArrays via the generic mapreduce implementation - which gets called by Base.sum automatically, so one actually doesn't even need to define sum in GPUArrays.

sum

device N = 10¹ N = 10⁷
arrayfire cl arrayfire cl 0.0106 ms 0.0005x 0.4 ms 3.8x
arrayfire cu arrayfire cu 0.0096 ms 0.0006x 0.5 ms 3.4x
cuarrays cuarrays 0.1 ms 4.7355e-5x 0.5 ms 3.1x
clarrays gpu clarrays gpu 0.0088 ms 0.0006x 0.6 ms 2.6x
gpuarrays threaded gpuarrays threaded 0.0114 ms 0.0005x 1.4 ms 1.2x
julia base julia base 5.262e-6 ms 1.0x 1.7 ms 1.0x
clarrays cpu clarrays cpu 0.0073 ms 0.0007x 13.9 ms 0.1x

code


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment