Without prange()
(single-threaded):
python -mtimeit -s'from test_cydot import a,b,out,cydot' 'cydot.dot(a,b,out)'
10 loops, best of 3: 119 msec per loop
With prange()
(number of threads == number of cores):
python -mtimeit -s'from test_cydot import a,b,out,cydot' 'cydot.dot(a,b,out)'
10 loops, best of 3: 69.9 msec per loop
numpy.dot()
version for comparison:
python -mtimeit -s'from test_cydot import a,b,out,np' 'np.dot(a,b,out)'
100 loops, best of 3: 9.97 msec per loop