BERT Optimization

benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

MKL: version 2019.4 (conda install mkl mkl-include)
MKLDNN: proposed in 21851

single instance (20 threads)

>>> ./run_inference.sh
408/408 [00:24<00:00, 16.69it/s]

MKLDNN

>>> ./run_instance.sh --mkldnn
408/408 [00:18<00:00, 21.95it/s]

multi instance (1 thread per instance)

>>> ./run_inference.sh --multi_instances
Average latency per example: 469.058ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 42.64
Total time: 23.453s

MKLDNN

>>> ./run_inference.sh --multi_instances --mkldnn
Average latency per example: 370.495ms
Total number of iterations: 1000
Total number of iterations per second (across all threads): 53.98
Total time: 18.525s

Impact of leading dimension padding

skylake has special requirements on leading dimension of GEMM, when LDA/LDB/LDC is multiple of 128, will cause cache flush issue, see ref.
The following table compares performance of BERT (glue/MRPC) GEMMs on MKL and MKLDNN with original size and padded size (+16).

Table-1: single socket test result (20 threads)

size(original)	MKL	MKLDNN	size (padded)	MKL	MKLDNN
N=128, I=768, O=768	818.57	417.03	N=128, I=784, O=784	1246.08	1282.33
N=128, I=768, O=3072	1369.88	1818.96	N=128, I=784, O=3088	1908.46	1931.12
N=128, I=3072, O=768	676.20	1262.61	N=128, I=3088, O=784	1768.28	1658.30

unit: Gflops

Use the following script to reproduce this result:

run.sh:

num_threads=$1
script=$2
last_core=`expr $num_threads - 1`


echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script

test_linear.py

import torch
import torch.nn as nn
from time import time

warmups = 1000
iters = 10000

def test_linear(batch_size, input_channel, output_channel):
    input = torch.randn(batch_size, input_channel)
    linear = nn.Linear(input_channel, output_channel)

    for i in range(warmups):
        output = linear(input)

    t1 = time()
    for i in range(iters):
        output = linear(input)
    t2 = time()
    tt = (t2-t1)/iters

    print("### Linear: (%d, %d) => (%d, %d): %f ms, %f Gflops"
            % (batch_size, input_channel, batch_size, output_channel,
              tt*1000, 2* batch_size*input_channel*output_channel/tt/1e9))

test_linear(128, 768, 768)
test_linear(128, 768, 3072)
test_linear(128, 3072, 768)
test_linear(128, 768+16, 768+16)
test_linear(128, 768+16, 3072+16)
test_linear(128, 3072+16, 768+16)

to run on single socket with 20 OMP threads:

./run.sh 20 test_linear.py

#!/bin/sh CORES=`lscpu | grep Core | awk '{print $4}'` corePerInstance=1 numInstance=$CORES export OMP_NUM_THREADS=${corePerInstance} export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 GLUE_DIR=./dataset/glue_data TASK_NAME=MRPC for ((i=0; i<${numInstance}; i++)) do startCore=$[${i}*${corePerInstance}] endCore=$[${startCore}+${corePerInstance}-1] echo "# Process [${i}/${numInstance}] => run with core ${startCore} - ${endCore}" if [ ${i} == $[${numInstance} -1] ] ## only show the last instance's output then taskset -c ${startCore}-${endCore} numactl -l python test_linear.py else taskset -c ${startCore}-${endCore} numactl -l python test_linear.py > /dev/null & fi done

mingfeima/bert_optimization.md

benchmark

MKL v.s. MKLDNN

Impact of leading dimension padding

mingfeima commented Jun 12, 2019 •

edited

Loading

Uh oh!

mingfeima commented Jun 14, 2019

Uh oh!

mingfeima/bert_optimization.md

benchmark

MKL v.s. MKLDNN

Impact of leading dimension padding

mingfeima commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima commented Jun 14, 2019

mkldnn multi_instance

mkl multi_instance

Uh oh!

mingfeima commented Jun 12, 2019 •

edited

Loading