On my 2014 MacBook Pro (2.5 GHz), this vectorized version is about 250 times faster when used for a large numbers of triangles compared to the original algorithm. You can also parallelize this code over multiple threads, but due to the large overhead of creating a thread pool, this is only really useful when you have a very large number of rays.
I also tried to vectorize the rays, instead of using a loop. I did this by using:
np.reshape(np.array(np.meshgrid(np.arange(0, len(ray_direction), 1), np.arange(0, len(triangles_vertices), 1))), (2, resulting_array_length))
to create all combinations of indices of the rays and the triangles. However, using this process, combined with using np.take()
to actually create these combinations is about twice as slow as the code below.
In case you do want to run it in parallel, you can do it like this:
from joblib import Parallel, delayed
res = Parallel(n_jobs=-1)(delayed(rays_triangles_intersection)(ray_origin, np.array([q]), triangles_vertices) for q in ray_directions)
On my machine, for a large (20k) number of rays, this gives about a 2x speed up over running it non-parallel.
For a regular (with parallelization) implementation, see the file at the bottom of this page.