See also: https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
Compile this:
int mkl_serv_intel_cpu_true() {
return 1;
}
into a shared library: gcc -shared -fPIC -o libfakeintel.so fakeintel.c
And LD_PRELOAD=/path/to/libfakeintel.so
(or use a module environment) before your MKL code.
Use the Python code provided below to test the performance improvement with a Numpy
release built against MKL, e.g. with conda from the defaults channel:
conda install -c defaults --override-channels numpy
Here's the code running normally:
N = 20000 N = 30000
36.43 seconds 132.03 seconds
36.70 seconds 131.61 seconds
37.61 seconds 130.95 seconds
36.16 seconds 137.04 seconds
36.51 seconds 134.37 seconds
37.40 seconds 140.52 seconds
35.40 seconds 129.63 seconds
40.15 seconds 133.25 seconds
40.92 seconds 131.32 seconds
36.65 seconds 134.69 seconds
And this is the same code when prepending the libfakeintel.so
library above:
N = 20000 N = 30000
12.93 seconds 39.54 seconds
13.51 seconds 37.51 seconds
12.30 seconds 38.28 seconds
13.21 seconds 40.76 seconds
14.28 seconds 36.63 seconds
12.72 seconds 36.95 seconds
12.62 seconds 40.93 seconds
13.23 seconds 42.98 seconds
12.42 seconds 40.93 seconds
12.44 seconds 39.76 seconds
Usually the improvement is around 3x for single precision operations.