mrocklin/numba-cuda-stencil.ipynb

Last active April 10, 2025 15:40

Star (9) You must be signed in to star a gist
Fork (2) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c.js"></script>
Save mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c to your computer and use it in GitHub Desktop.

Download ZIP

Raw

numba-cuda-stencil.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

pentschev commented Apr 10, 2019

out_gpu = cupy.ones((10000, 10000), dtype='int8')

This should be cupy.zeros(), since you're not computing image borders.

Author

mrocklin commented Apr 11, 2019 •

edited

Loading

A memory allocation is likely more expensive than the compute kernel entirely, especially in the case of cudaMalloc

This is why we have pool memory allocators, yes?

You could also try running the CPU kernel inplace by passing the input array into the out parameter of the stencil decorator:

I tried this briefly and wasn't able to get it to work. I also ended up timing the allocation on the CPU side and it was only 40-50ms, which is about 10% of the total compute time. I agree though that this would be useful to investigate further if someone does a real benchmark here (that is not my intention for this particular notebook).

This may be capturing JIT compilation on each run since it's only 1 loop,

I've rerun it several times within the same process (to avoid the JIT compilation) and didn't notice any difference.

Peacekeep3r commented Jul 23, 2020 •

edited

Loading

this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!

sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:

@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.

Karpisek commented Dec 9, 2020

this example is great and seems to be everywhere on the internet, but I think there is a bug in using cupy-arrays. For one thing, you should get identical (?) performance feeding Numpy-Arrays, since the calculations are both done on gpu anyway. More importantly, I think that using cupy-arrays causes timeit to show only the kernel invocation time - nothing has actually been calculated. Can you please check this again? This is a top Google search result for numpy gpu stencils. Try to print the output, and the calculation will actually run. I get around 160 ms!

sadly the cpu version using parallel computing is still faster even for big arrays! (60 ms). The original stencil function is just slow in numba. Better do it manually:
@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9
edit: it seems I was wrong and it's mostly because of data transfer times as the cupy arrays are already on the GPU. I still think it needs a "cuda.synchronize()" for a fair comparison which increase running time quite alot.

It beeing referenced from Dask documentation as well...

mrocklin/numba-cuda-stencil.ipynb

pentschev commented Apr 10, 2019

Uh oh!

mrocklin commented Apr 11, 2019 •

edited

Loading

Uh oh!

Peacekeep3r commented Jul 23, 2020 •

edited

Loading

Uh oh!

Karpisek commented Dec 9, 2020

Uh oh!

mrocklin/numba-cuda-stencil.ipynb

pentschev commented Apr 10, 2019

Uh oh!

mrocklin commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Peacekeep3r commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Karpisek commented Dec 9, 2020

Uh oh!

mrocklin commented Apr 11, 2019 •

edited

Loading

Peacekeep3r commented Jul 23, 2020 •

edited

Loading