test.md

PDE

Kuramoto-Sivashinsky algorithm benchmark (original benchmark).

This benchmark is dominated by the cost of the FFT, leading to worse results for OpenCL with CLFFT compared to the faster CUFFT. Similarly the multithreaded backend doesn't improve much over base with the same FFT implementation. Result of the benchmarked PDE:

device	N = 10¹	N = 10⁷
cuarrays	`0.0013 s` `0.0069x`	`1.7 s` `16.4x`
clarrays gpu	`0.0128 s` `0.0007x`	`3.8 s` `7.3x`
clarrays cpu	`0.0244 s` `0.0004x`	`16.8 s` `1.6x`
gpuarrays threaded	`0.0002 s` `0.0531x`	`22.5 s` `1.2x`
julia base	`8.719e-6 s` `1.0x`	`27.7 s` `1.0x`

code

Juliaset

device	N = 2¹²	N = 2²⁴
clarrays gpu	`0.0356 ms` `2.3x`	`2.9 ms` `96.5x`
cuarrays	`0.0099 ms` `8.3x`	`4.2 ms` `66.7x`
clarrays cpu	`0.0495 ms` `1.7x`	`31.9 ms` `8.7x`
gpuarrays threaded	`0.0337 ms` `2.4x`	`109.4 ms` `2.5x`
julia base	`0.0821 ms` `1.0x`	`278.8 ms` `1.0x`

code

Blackscholes

Blackschole is a nice benchmark for broadcasting performance. It's a medium heavy calculation per array element, where the calculation is completely independant from each other.

device	N = 10¹	N = 10⁷
arrayfire cu	`0.1 ms` `0.0075x`	`2.7 ms` `301.2x`
cuarrays	`0.0087 ms` `0.1x`	`2.8 ms` `288.5x`
clarrays gpu	`0.0419 ms` `0.0222x`	`2.9 ms` `280.5x`
arrayfire cl	`0.1 ms` `0.0076x`	`3.2 ms` `250.7x`
clarrays cpu	`0.048 ms` `0.0194x`	`14.8 ms` `53.9x`
gpuarrays threaded	`0.0015 ms` `0.6x`	`173.8 ms` `4.6x`
julia base	`0.0009 ms` `1.0x`	`800.0 ms` `1.0x`

code

Poincare

Poincare section of a chaotic neuronal network. The domination of OpenCL in this benchmark might be due to a better use of vector intrinsics in Transpiler.jl, but needs some more investigations. Result of calculation:

device	N = 10³	N = 10⁹
clarrays gpu	`4.928e-5 s` `0.0827x`	`0.1 s` `303.1x`
clarrays cpu	`5.2626e-5 s` `0.0775x`	`1.3 s` `34.1x`
gpuarrays threaded	`0.0003 s` `0.0126x`	`7.3 s` `6.1x`
julia base	`4.078e-6 s` `1.0x`	`44.4 s` `1.0x`

code

Sum

Note that sum is implemented in GPUArrays via the generic mapreduce implementation - which gets called by Base.sum automatically, so one actually doesn't even need to define sum in GPUArrays.

device	N = 10¹	N = 10⁷
arrayfire cl	`0.0106 ms` `0.0005x`	`0.4 ms` `3.8x`
arrayfire cu	`0.0096 ms` `0.0006x`	`0.5 ms` `3.4x`
cuarrays	`0.1 ms` `4.7355e-5x`	`0.5 ms` `3.1x`
clarrays gpu	`0.0088 ms` `0.0006x`	`0.6 ms` `2.6x`
gpuarrays threaded	`0.0114 ms` `0.0005x`	`1.4 ms` `1.2x`
julia base	`5.262e-6 ms` `1.0x`	`1.7 ms` `1.0x`
clarrays cpu	`0.0073 ms` `0.0007x`	`13.9 ms` `0.1x`

code

SimonDanisch/test.md

PDE

Juliaset

Blackscholes

Poincare

Sum