mandelbrot-on-all-accelerators.ipynb

Author

jpivarski commented Sep 19, 2024

Good point... CPython is a little faster when dealing with its own builtin types, too.

Replacing the NumPy data structure with a list of lists (in the hot part of the loop), here's CPython again:

Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) 
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import numpy as np
>>> def run_python(height, width):
...     y, x = np.ogrid[-1:0:height*1j, -1.5:0:width*1j]
...     c = x + y*1j
...     fractal = np.full(c.shape, 20, dtype=np.int32).tolist()
...     x, y, c = x.tolist(), y.tolist(), c.tolist()
...     for h in range(height):
...         for w in range(width):                  # for each pixel (h, w)...
...             z = c[h][w]
...             for i in range(20):                 # iterate at most 20 times
...                 z = z**2 + c[h][w]              # applying z → z² + c
...                 if abs(z) > 2:                  # if it diverges (|z| > 2)
...                     fractal[h][w] = i           # color the plane with the iteration number
...                     break                       # we're done, no need to keep iterating
...     return fractal
... 
>>> timeit.timeit(lambda: run_python(200, 300), number=10)
1.000359961999493
>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.9851951990021917
>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.9806630249986483
>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.9808432169993466
>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.9776434310006152

and here's pypy:

Python 3.9.18 | packaged by conda-forge | (9c4f8ef1, Mar 08 2024, 07:32:51)
[PyPy 7.3.15 with GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>> import timeit
>>>> import numpy as np
>>>> def run_python(height, width):
....     y, x = np.ogrid[-1:0:height*1j, -1.5:0:width*1j]
....     c = x + y*1j
....     fractal = np.full(c.shape, 20, dtype=np.int32).tolist()
....     x, y, c = x.tolist(), y.tolist(), c.tolist()
....     for h in range(height):
....         for w in range(width):                  # for each pixel (h, w)...
....             z = c[h][w]
....             for i in range(20):                 # iterate at most 20 times
....                 z = z**2 + c[h][w]              # applying z → z² + c
....                 if abs(z) > 2:                  # if it diverges (|z| > 2)
....                     fractal[h][w] = i           # color the plane with the iteration number
....                     break                       # we're done, no need to keep iterating
....     return fractal
.... 
>>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.6174074949994974
>>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.6332233270004508
>>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.7321933960010938
>>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.6341267679999874
>>>> timeit.timeit(lambda: run_python(200, 300), number=10)
0.6687725560004765

Now we get pypy being about 1.7× faster than CPython, which is in the ballpark of what I'd expect.

Personally, I'm still a lot more swayed by the 200× that you get through other methods. For any numerical work, I'd try to get the operation on numerical data compiled with known types, no boxing, no garbage collectors, and all the rest.

itamarst commented Oct 17, 2024 •

edited

Loading

I got intrigued and went and found a dual scalar / handwritten portable SIMD implementation of the Mandelbrot algorithm: https://pythonspeed.com/articles/optimizing-with-simd/

m4jow commented Jul 30, 2025 •

edited

Loading

Thanks for sharing. To avoid memory access errors, a CUDA kernel must still check whether the x- and y-indices are within the array boundaries. For large zoom depths, it is useful to integrate perturbation theory as shown here:
https://rosettacode.org/wiki/Mandelbrot_set#Normal_Map_Effect,_Mercator_Projection_and_Deep_Zoom_Images
Some sample programs that use DPEP and Modular instead of CUDA on non-NVIDIA hardware can be found here:
https://github.com/IntelPython/DPEP/tree/main/demos/mandelbrot
https://github.com/modular/modular/tree/main/examples/custom_ops
https://github.com/modular/modular/tree/main/examples/mojo/python-interop

34j commented Aug 20, 2025 •

edited

Loading

Just wanted to mention that I did similar research (comparing Numba, Taichi, Warp, and JAX with different number of pixels & 200 loops) at https://github.com/34j/mandelbrot-benchmark. Hope this helps.

jpivarski/mandelbrot-on-all-accelerators.ipynb

Select an option

No results found

Select an option

No results found

jpivarski commented Sep 19, 2024

Uh oh!

itamarst commented Oct 17, 2024 •

edited

Loading

Uh oh!

m4jow commented Jul 30, 2025 •

edited

Loading

Uh oh!

34j commented Aug 20, 2025 •

edited

Loading

Uh oh!

jpivarski/mandelbrot-on-all-accelerators.ipynb

jpivarski commented Sep 19, 2024

Uh oh!

itamarst commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m4jow commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

34j commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itamarst commented Oct 17, 2024 •

edited

Loading

m4jow commented Jul 30, 2025 •

edited

Loading

34j commented Aug 20, 2025 •

edited

Loading