This gist includes benchmarks for matrix multiplication (DGEMM) using numpy in Pyodide.
Currently Pyodide includes Netlib (reference) BLAS via CLAPACK, and it would useful to replace if by a high performance BLAS such as BLIS in the future (pyodide#227). Related to benchmarks done by @ethanhs for Blis with WebAssembly https://twitter.com/ethanhs/status/1381500482858803200
We can install the same versions of Python and numpy as those included in Pyodide 0.17.0dev0 with,
conda create -n pyodide-benchmark -c conda-forge python=3.8.2 numpy=1.17.5
conda activate pyodide-benchmark
benchmark.py
import numpy as np
from time import time
N = 1000
X = np.random.RandomState(0).rand(N, N)
t0 = time()
X.dot(X)
print(f'Wall time: {time() - t0:.2f} s')
Now we can switch the BLAS implementation used by numpy, for instance with,
conda install "libblas=*=*blis"
see conda-forge documentation for more details.
Then run the benchmarks as,
OMP_NUM_THREADS=1 python benchmark.py
to control the number of threads used explicitly.
To run the benchmark in Pyodide 0.17.0dev0, copy-paste the benchmark.py
file
to the Pyodide REPL.
Because we don't use dedicated timing tools such as %timeit IPython magic or pyperf, the timing particularly for faster cases are not too reliable.
Configuration | time (s) / N=1000 | time (s) / N=2000 | time (s) / N = 4000 |
---|---|---|---|
native, Netlib BLAS, single threaded | 0.28 | 4.2 | 36 |
native, Blis 0.8, single threaded | 0.04 | 0.30 | 2.7 |
native, Blis 0.8, 4 threads | 0.02 | 0.16 | 1.7 |
Pyodide (wasm32) with Firefox 87, Netlib BLAS, single threaded | 1.3 | 22 | 367 |
Pyodide (wasm32) with Chrome 89, Netlib BLAS, single threaded | 1.0 | 18 | 263 |
So overall Pyodide with Netlib (reference) BLAS is 4 to 10 times slower in Firefox and 3 to 7 slower in Chrome. Larger matrices appears to produce largter slowdowns.
The question is what improvements we could expect in Pyodide by using Blis, given that for now both threading and SIMD support are disabled in Pyodide and Blis doesn't yet have dedicated kernels for WebAssembly.