Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:
- prepare dataset according to link.
- update
GLUE_DIR
to actual dataset path inrun_inference.sh
. - change env settings, the default setting is using 20 cores;
Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.
- MKL: version 2019.4 (conda install mkl mkl-include)
- MKLDNN: proposed in 21851
single instance (20 threads)
- MKL
>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]
- MKLDNN
>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]
multi instance (1 thread per instance)
- MKL
>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s
- MKLDNN
>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s
- skylake has special requirements on leading dimension of GEMM, when
LDA/LDB/LDC
is multiple of128
, will cause cache flush issue, see ref. - The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).
Table-1: single socket test result (20 threads)
size(original) | MKL | MKLDNN | size (padded) | MKL | MKLDNN |
---|---|---|---|---|---|
N=128, I=768, O=768 | 818.57 | 417.03 | N=128, I=784, O=784 | 1246.08 | 1282.33 |
N=128, I=768, O=3072 | 1369.88 | 1818.96 | N=128, I=784, O=3088 | 1908.46 | 1931.12 |
N=128, I=3072, O=768 | 676.20 | 1262.61 | N=128, I=3088, O=784 | 1768.28 | 1658.30 |
unit: Gflops
- Use the following script to reproduce this result:
run.sh:
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
test_linear.py
import torch
import torch.nn as nn
from time import time
warmups = 1000
iters = 10000
def test_linear(batch_size, input_channel, output_channel):
input = torch.randn(batch_size, input_channel)
linear = nn.Linear(input_channel, output_channel)
for i in range(warmups):
output = linear(input)
t1 = time()
for i in range(iters):
output = linear(input)
t2 = time()
tt = (t2-t1)/iters
print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
% (batch_size, input_channel, batch_size, output_channel,
tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))
test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)
to run on single socket with 20 OMP threads:
./run.sh 20 test_linear.py
TODOs: