Kuramoto-Sivashinsky algorithm benchmark (original benchmark).
This benchmark is dominated by the cost of the FFT, leading to worse results for OpenCL with
CLFFT compared to the faster CUFFT.
Similarly the multithreaded backend doesn't improve much over base with the same FFT implementation.
Result of the benchmarked PDE:


| device |
N = 10¹ |
N = 10⁷ |
cuarrays |
0.0013 s 0.0069x |
1.7 s 16.4x |
clarrays gpu |
0.0128 s 0.0007x |
3.8 s 7.3x |
clarrays cpu |
0.0244 s 0.0004x |
16.8 s 1.6x |
gpuarrays threaded |
0.0002 s 0.0531x |
22.5 s 1.2x |
julia base |
8.719e-6 s 1.0x |
27.7 s 1.0x |
code


| device |
N = 2¹² |
N = 2²⁴ |
clarrays gpu |
0.0356 ms 2.3x |
2.9 ms 96.5x |
cuarrays |
0.0099 ms 8.3x |
4.2 ms 66.7x |
clarrays cpu |
0.0495 ms 1.7x |
31.9 ms 8.7x |
gpuarrays threaded |
0.0337 ms 2.4x |
109.4 ms 2.5x |
julia base |
0.0821 ms 1.0x |
278.8 ms 1.0x |
code
Blackschole is a nice benchmark for broadcasting performance.
It's a medium heavy calculation per array element, where the calculation is completely
independant from each other.

| device |
N = 10¹ |
N = 10⁷ |
arrayfire cu |
0.1 ms 0.0075x |
2.7 ms 301.2x |
cuarrays |
0.0087 ms 0.1x |
2.8 ms 288.5x |
clarrays gpu |
0.0419 ms 0.0222x |
2.9 ms 280.5x |
arrayfire cl |
0.1 ms 0.0076x |
3.2 ms 250.7x |
clarrays cpu |
0.048 ms 0.0194x |
14.8 ms 53.9x |
gpuarrays threaded |
0.0015 ms 0.6x |
173.8 ms 4.6x |
julia base |
0.0009 ms 1.0x |
800.0 ms 1.0x |
code
Poincare section of a chaotic neuronal network.
The domination of OpenCL in this benchmark might be due to a better use of vector intrinsics in Transpiler.jl, but needs some
more investigations.
Result of calculation:


| device |
N = 10³ |
N = 10⁹ |
clarrays gpu |
4.928e-5 s 0.0827x |
0.1 s 303.1x |
clarrays cpu |
5.2626e-5 s 0.0775x |
1.3 s 34.1x |
gpuarrays threaded |
0.0003 s 0.0126x |
7.3 s 6.1x |
julia base |
4.078e-6 s 1.0x |
44.4 s 1.0x |
code
Note that sum is implemented in GPUArrays via the generic mapreduce implementation - which gets called by Base.sum automatically, so one actually doesn't even need to define sum in GPUArrays.

| device |
N = 10¹ |
N = 10⁷ |
arrayfire cl |
0.0106 ms 0.0005x |
0.4 ms 3.8x |
arrayfire cu |
0.0096 ms 0.0006x |
0.5 ms 3.4x |
cuarrays |
0.1 ms 4.7355e-5x |
0.5 ms 3.1x |
clarrays gpu |
0.0088 ms 0.0006x |
0.6 ms 2.6x |
gpuarrays threaded |
0.0114 ms 0.0005x |
1.4 ms 1.2x |
julia base |
5.262e-6 ms 1.0x |
1.7 ms 1.0x |
clarrays cpu |
0.0073 ms 0.0007x |
13.9 ms 0.1x |
code