This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.
To use MKLDNN RNN in PyTorch:
- convert model to mkldnn
- (optional) convert input and hx/cx to mkldnn
example: how to enable mkl-dnn
RNN
import torch
from torch.utils import mkldnn as mkldnn_utils
# replace LSTM with MkldnnLSTM
rnn = torch.nn.LSTM(10, 20)
mkldnn_rnn = mkldnn_utils.to_mkldnn(rnn)
# random input
input = torch.randn(1, 5, 10)
hx = torch.randn(1, 5, 20)
cx = torch.randn(1, 5, 20)
# (optional) convert INPUTs into mkldnn layout
# Logic here is that
# a) if the input/hx/cx is mkldnn layout, output/hy/cy will be mkldnn layout
# b) if the input/hx/cx is dense layout, output/hy/cy will be dense layout
# to_mkldnn() is a out place memory copy, try to avoid do this for very iteration
# from performance pespective
input = input.to_mkldnn()
hx = hx.to_mkldnn()
cx = cx.to_mkldnn()
# evaluation
output, hidden = mkldnn_rnn(input, (hx, cx))
hy, cy = hidden
MKLDNN RNN has some quite special presettings that differs from PyTorch:
/* MKLDNN RNN weight format:
* mkldnn expects 3 tensor for all layers/directions:
* weight_ih (ldigo): {num_layers, num_directions, input_size, num_gates, hidden_size}
* weight_hh (ldigo): {num_layers, num_directions, hidden_size, num_gates, hidden_size}
* bias (ldgo): {num_layers, num_directions, num_biases, hidden_size}
*
* for LSTM, bias has 4 gates:
* bias = bias_ih + bias_hh
*
* for GRU, bias has 4 gates:
* (PyTorch GRU bias) (MKLDNN GRU bias)
* bias_ih bias_hh bias
* +-----+ +-----+ +---------+
* | rt1 | | rt2 | | zt1+zt2 |
* |-----| |-----| |---------|
* | zt1 | | zt2 | | rt1+rt2 |
* |-----| |-----| |---------|
* | nt1 | | nt2 | | nt1 |
* +-----+ +-----+ |---------|
* | nt2 |
* +---------+
*
* PyTorch RNN weight format:
* a list of length num_layers * num_directions:
* {
* weight_ih_00, weight_hh_00, bias_ih_00, bias_hh_00 // layer = 0, direction = 0
* weight_ih_01, weight_hh_01, bias_ih_01, bias_hh_01 // layer = 0, direction = 1
* ..., ..., ..., ...,
* weight_ih_ld, weight_hh_ld, bias_ih_ld, bias_hh_ld // layer = l, direction = d
* }
* weight_ih_ld: {num_gates * hidden_size, input_size}
* weight_hh_ld: {num_gates * hidden_size, hidden_size}
* bias_ih_ld: {num_gates * hidden_size}
* bias_hh_ld: {num_gates * hidden_size}
*/
MKLDNN RNN improves LSTM inference performance upto 5x, use benchmark to reproduce the result. The benchmark is using input_size=250
, hidden_size=200
and run with single socket (20 cores) and single core respectively.
For the scenario of time_step=1
and single core inference, memory allocation consumes a considerable amount of time (~1/3), use jemmalloc can significantly improve overall performance, follow wiki to compile libjemalloc.so. This will give you additional 30% performance boost, free launch.
### run original
./run_single_batch_inference.sh
### run mkldnn
./run_single_batch_inference.sh --mkldnn
### run original with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh
### run mkldnn with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh --mkldnn
performance result on Xeon 6148 (unit: sentences per second, the higher the better)
time_step | cores | original | mkldnn | original (jemalloc) | mkldnn (jemalloc) | mkldnn v.s. original | mkldnn jemalloc boost |
---|---|---|---|---|---|---|---|
15 | 20 | 629 | 3184 | 768 | 4114 | 5.06 | 1.29 |
15 | 1 | 807 | 2976 | 900 | 3676 | 3.69 | 1.24 |
1 | 1 | 5100 | 6653 | 5668 | 8418 | 1.30 | 1.27 |
To further improve the performance:
- mkldnn requires
hx
,cx
to be concat into one tensorsrc_iter
, the concat insideideep
is 3x thanat::cat
. - correspondingly, mkldnn requires
dst_iter
to be split intohy
,cy
, the splitat::chunk
is inplace and take no time,ideep::splitter
is a memory copy. - (done) double check whether
exp
andtanh
is properly vectorized: from v0.20 on, elemwise ops in RNN is properly vectorized. - provide inplace conversion to cpu tensor and mkldnn tensor.
Hi Mingfei,
I am trying to use MKLDNN to accelerate LSTM inference on a dual core Xeon server, and got following error msg:
RuntimeError: mkldnn_linear: weight and bias need to be mkldnn layout
After many searches and debug I have no clue and didn't find any document about this error. Could you please give a guidance?
Thank you very much!
Xiao