-
-
Save mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c to your computer and use it in GitHub Desktop.
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Comparing Stencil Computations on the CPU and GPU\n", | |
"=================================================\n", | |
"\n", | |
"This notebook builds off of a [recent blogpost](https://blog.dask.org/2019/04/09/numba-stencil) using Numba, Dask, NumPy, to build parallel compiled computations easily.\n", | |
"\n", | |
"At the end of that post I posed the question of how we could run these computations on the GPU. I was curious both on how much harder it would be to write and on how much faster it would go. This notebook explores this question by using [numba.cuda.jit](https://numba.pydata.org/numba-doc/dev/cuda/index.html).\n", | |
"\n", | |
"We learn that it is, in fact, much faster to use a GPU and that, if you're ok with copy-pasting a bit of code that you might not understand, it's also quite easy for someone who is unfamiliar with GPUs, like myself." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Stencil computations on CPU" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import numba\n", | |
"\n", | |
"@numba.stencil\n", | |
"def _smooth(x):\n", | |
" return (x[-1, -1] + x[-1, 0] + x[-1, 1] +\n", | |
" x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +\n", | |
" x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9\n", | |
"\n", | |
"@numba.njit\n", | |
"def smooth_cpu(x):\n", | |
" return _smooth(x)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"621 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"x_cpu = np.ones((10000, 10000), dtype='int8')\n", | |
"\n", | |
"%timeit smooth_cpu(x_cpu)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Stencil computations on GPU\n", | |
"\n", | |
"Using the `numba.cuda` module I'm able to get about a 200x increase with a modest increase in code complexity." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from numba import cuda\n", | |
"\n", | |
"@cuda.jit\n", | |
"def smooth_gpu(x, out):\n", | |
" i, j = cuda.grid(2)\n", | |
" n, m = x.shape\n", | |
" if 1 <= i < n - 1 and 1 <= j < m - 1:\n", | |
" out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n", | |
" x[i , j - 1] + x[i , j] + x[i , j + 1] +\n", | |
" x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"2.87 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"import cupy, math\n", | |
"\n", | |
"x_gpu = cupy.ones((10000, 10000), dtype='int8')\n", | |
"out_gpu = cupy.zeros((10000, 10000), dtype='int8')\n", | |
"\n", | |
"# I copied the four lines below from the Numba docs\n", | |
"threadsperblock = (16, 16)\n", | |
"blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])\n", | |
"blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])\n", | |
"blockspergrid = (blockspergrid_x, blockspergrid_y)\n", | |
"\n", | |
"%timeit smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"*Note: the GPU solution here cheats a bit because it pre-allocates the output array*" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Final thoughts\n", | |
"\n", | |
"GPUs are fast at doing local stencil computations. This isn't surprising, but it's nice to see. \n", | |
"\n", | |
"What is surprisingly pleasant is how easy it is to write CUDA-like array computing code from Python with Numba and CuPy.\n", | |
"\n", | |
"Actually that's not entirely true. I still don't fully understand how I should determine the `blockspergrid` and `threadsperblock` parameters to the `smooth_gpu` kernel execution, and I suspect that many novice users share my uncertainty. I am curious if there is a way to make a decent choice here given information that I do know (like the shape of the grid I'd like to compute over) and other information that can be automatically pulled from the hardware. I would much rather say, for example...\n", | |
"\n", | |
"```python\n", | |
"@cuda.jit\n", | |
"def smooth_gpu(x, out):\n", | |
" n, m = x.shape\n", | |
" i, j = cuda.grid[1:n - 1, 1:m - 1] # <-- define grid here?\n", | |
" \n", | |
" out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +\n", | |
" x[i , j - 1] + x[i , j] + x[i , j + 1] +\n", | |
" x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9\n", | |
" \n", | |
"\n", | |
"smooth_gpu(x_gpu, out_gpu)\n", | |
"```\n", | |
"\n", | |
"... or something similar (I don't know the technical constraints here well enough to pose solutions with much probability of success." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [conda env:cudf]", | |
"language": "python", | |
"name": "conda-env-cudf-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
out_gpu = cupy.ones((10000, 10000), dtype='int8')
This should be cupy.zeros()
, since you're not computing image borders.
A memory allocation is likely more expensive than the compute kernel entirely, especially in the case of cudaMalloc
This is why we have pool memory allocators, yes?
You could also try running the CPU kernel inplace by passing the input array into the out parameter of the stencil decorator:
I tried this briefly and wasn't able to get it to work. I also ended up timing the allocation on the CPU side and it was only 40-50ms, which is about 10% of the total compute time. I agree though that this would be useful to investigate further if someone does a real benchmark here (that is not my intention for this particular notebook).
This may be capturing JIT compilation on each run since it's only 1 loop,
I've rerun it several times within the same process (to avoid the JIT compilation) and didn't notice any difference.
this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!
sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:
@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):
for i in prange(1,np.shape(x)[0]-1):
for j in range(1,np.shape(x)[1]-1):
out_cpu[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i , j - 1] + x[i , j] + x[i , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.
@
this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!
sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:
@njit(nopython=True,parallel=True) def smooth_cpu(x, out_cpu): for i in prange(1,np.shape(x)[0]-1): for j in range(1,np.shape(x)[1]-1): out_cpu[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i , j - 1] + x[i , j] + x[i , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.
It beeing referenced from Dask documentation as well...
That's not "a bit". A memory allocation is likely more expensive than the compute kernel entirely, especially in the case of cudaMalloc. If we wanted to be fair, you should allocate a new GPU output array and have the kernel populate that array based on the input array. You could also try running the CPU kernel inplace by passing the input array into the
out
parameter of the stencil decorator: https://numba.pydata.org/numba-doc/dev/user/stencil.html#out.For 1d arrays you can use
.forall(input.size)
to have it handle the threadperblock and blockpergrid sizing under the hood but this doesn't exist for 2d+ arrays unfortunately. The current 16 threads per block seems really low where typically you see 128 or 256 so I'm not sure if this is best practice sans for a minimal documentation example. Maybe someone else can comment on a better threads per block and blocks per grid setting based on the 10k x 10k input array.This may be capturing JIT compilation on each run since it's only 1 loop, would recommend running an equal number of loops between CPU and GPU to be fair with regards to caching the JIT kernel compilation.