Benchmarking matrix multiplications with numpy in Pyodide 0.17.0dev0

This gist includes benchmarks for matrix multiplication (DGEMM) using numpy in Pyodide.

Currently Pyodide includes Netlib (reference) BLAS via CLAPACK, and it would useful to replace if by a high performance BLAS such as BLIS in the future (pyodide#227). Related to benchmarks done by @ethanhs for Blis with WebAssembly https://twitter.com/ethanhs/status/1381500482858803200

Local setup

We can install the same versions of Python and numpy as those included in Pyodide 0.17.0dev0 with,

conda create -n pyodide-benchmark -c conda-forge python=3.8.2 numpy=1.17.5
conda activate pyodide-benchmark

Benchmark code

benchmark.py

import numpy as np
from time import time

N = 1000
X = np.random.RandomState(0).rand(N, N)
t0 = time()
X.dot(X)
print(f'Wall time: {time() - t0:.2f} s')

Running the benchmark natively

Now we can switch the BLAS implementation used by numpy, for instance with,

conda install "libblas=*=*blis"

see conda-forge documentation for more details.

Then run the benchmarks as,

OMP_NUM_THREADS=1 python benchmark.py

to control the number of threads used explicitly.

Running the benchmark in Pyodide

To run the benchmark in Pyodide 0.17.0dev0, copy-paste the benchmark.py file to the Pyodide REPL.

Results

Because we don't use dedicated timing tools such as %timeit IPython magic or pyperf, the timing particularly for faster cases are not too reliable.

Configuration	time (s) / N=1000	time (s) / N=2000	time (s) / N = 4000
native, Netlib BLAS, single threaded	0.28	4.2	36
native, Blis 0.8, single threaded	0.04	0.30	2.7
native, Blis 0.8, 4 threads	0.02	0.16	1.7
Pyodide (wasm32) with Firefox 87, Netlib BLAS, single threaded	1.3	22	367
Pyodide (wasm32) with Chrome 89, Netlib BLAS, single threaded	1.0	18	263

So overall Pyodide with Netlib (reference) BLAS is 4 to 10 times slower in Firefox and 3 to 7 slower in Chrome. Larger matrices appears to produce largter slowdowns.

The question is what improvements we could expect in Pyodide by using Blis, given that for now both threading and SIMD support are disabled in Pyodide and Blis doesn't yet have dedicated kernels for WebAssembly.