Created
January 31, 2021 23:28
-
-
Save ogrisel/87dcf2c3ab8a304ededf75934b116b61 to your computer and use it in GitHub Desktop.
Running some benchmark of BLAS level 3 and LAPACK on Apple M1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
try: | |
import tensorflow as tf | |
except ImportError: | |
tf = None | |
from time import perf_counter | |
def timeit(func, *args, **kwargs): | |
durations = [] | |
for i in range(10): | |
tic = perf_counter() | |
func(*args, **kwargs) | |
toc = perf_counter() | |
durations.append(toc - tic) | |
durations.sort() | |
return np.mean(durations[:-5]) # trim stragglers | |
for dtype in (np.float32, np.float64): | |
dtype_name = dtype.__name__ | |
rng = np.random.default_rng(42) | |
A = rng.normal(size=(1024, 1024)).astype(dtype) | |
B = rng.normal(size=(1024, 1024)).astype(dtype) | |
C = np.empty_like(A @ B) | |
d = timeit(np.dot, A, B, out=C) | |
GFLOP = A.shape[0] * B.shape[1] * (2 * A.shape[1] + 2) / 1e9 | |
print(f"[{dtype_name}] np.dot: {d * 1e3:.3f} ms, {GFLOP / d:.1f} GFLOP/s") | |
d = timeit(np.linalg.svd, A) | |
print(f"[{dtype_name}] np.linalg.svd: {d * 1e3:.3f} ms") | |
if tf is not None: | |
A = tf.constant(A) | |
B = tf.constant(B) | |
d = timeit(np.matmul, A, B) | |
GFLOP = A.shape[0] * B.shape[1] * (2 * A.shape[1] + 2) / 1e9 | |
print(f"[{dtype_name}] tf.matmul: {d * 1e3:.3f} ms, {GFLOP / d:.1f} GFLOP/s") | |
d = timeit(tf.linalg.svd, A) | |
print(f"[{dtype_name}] tf.linalg.svd: {d * 1e3:.3f} ms") |
Note: there is a problem in the code above: I use np.matmul
on the tf tensors instead of tf.matmul
. However if I change to tf.matmul I sometimes get weird results depending on weither I use constant tensors (back by numpy arrays) or GPU allocated data where it seems that tensorflow eager mode is not that eager anymore.
So I would just not trust the TF results.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here are the results.
numpy with OpenBLAS from conda-forge
Comment: with the default number of threads (8), the performance is not optimal with OpenBLAS on Apple M1. Using limit 4 cores gives better performance.
numpy with Accelerate for BLAS and netlib for LAPACK
From this conda-forge hack to use the experimental isuruf/vecLibFort.
Comments:
numpy
and tensorflowfrom the Apple fork of TFUsing the numpy 1.18 wheel from https://github.com/apple/tensorflow_macos (0.1alpha1 release)
Comments:
[float32] np.dot: 2.272 ms, 946.0 GFLOP/s
on a machine without a fan !!!Edit a prior version of this results used
OPENBLAS_NUM_THREADS
instead ofOMP_NUM_THREADS
for the OpenBLAS runs.OPENBLAS_NUM_THREADS
has no effect on the openblas shipped by conda-forge andOMP_NUM_THREADS
should be used instead.