-
-
Save jpivarski/da343abd8024834ee8c5aaba691aafc7 to your computer and use it in GitHub Desktop.
I got intrigued and went and found a dual scalar / handwritten portable SIMD implementation of the Mandelbrot algorithm: https://pythonspeed.com/articles/optimizing-with-simd/
Thanks for sharing. To avoid memory access errors, a CUDA kernel must still check whether the x- and y-indices are within the array boundaries. For large zoom depths, it is useful to integrate perturbation theory as shown here:
https://rosettacode.org/wiki/Mandelbrot_set#Normal_Map_Effect,_Mercator_Projection_and_Deep_Zoom_Images
Some sample programs that use DPEP and Modular instead of CUDA on non-NVIDIA hardware can be found here:
https://github.com/IntelPython/DPEP/tree/main/demos/mandelbrot
https://github.com/modular/modular/tree/main/examples/custom_ops
https://github.com/modular/modular/tree/main/examples/mojo/python-interop
Just wanted to mention that I did similar research (comparing Numba, Taichi, Warp, and JAX with different number of pixels & 200 loops) at https://github.com/34j/mandelbrot-benchmark. Hope this helps.

Good point... CPython is a little faster when dealing with its own builtin types, too.
Replacing the NumPy data structure with a list of lists (in the hot part of the loop), here's CPython again:
and here's pypy:
Now we get pypy being about 1.7× faster than CPython, which is in the ballpark of what I'd expect.
Personally, I'm still a lot more swayed by the 200× that you get through other methods. For any numerical work, I'd try to get the operation on numerical data compiled with known types, no boxing, no garbage collectors, and all the rest.