This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.
To use MKLDNN RNN in PyTorch:
- convert model to mkldnn
- (optional) convert input and hx/cx to mkldnn
example: how to enable mkl-dnn RNN
import torch
from torch.utils import mkldnn as mkldnn_utils
# replace LSTM with MkldnnLSTM
rnn = torch.nn.LSTM(10, 20)
mkldnn_rnn = mkldnn_utils.to_mkldnn(rnn)
# random input
input = torch.randn(1, 5, 10)
hx = torch.randn(1, 5, 20)
cx = torch.randn(1, 5, 20)
# (optional) convert INPUTs into mkldnn layout
# Logic here is that
# a) if the input/hx/cx is mkldnn layout, output/hy/cy will be mkldnn layout
# b) if the input/hx/cx is dense layout, output/hy/cy will be dense layout
# to_mkldnn() is a out place memory copy, try to avoid do this for very iteration
# from performance pespective
input = input.to_mkldnn()
hx = hx.to_mkldnn()
cx = cx.to_mkldnn()
# evaluation
output, hidden = mkldnn_rnn(input, (hx, cx))
hy, cy = hiddenMKLDNN RNN has some quite special presettings that differs from PyTorch:
/* MKLDNN RNN weight format:
* mkldnn expects 3 tensor for all layers/directions:
* weight_ih (ldigo): {num_layers, num_directions, input_size, num_gates, hidden_size}
* weight_hh (ldigo): {num_layers, num_directions, hidden_size, num_gates, hidden_size}
* bias (ldgo): {num_layers, num_directions, num_biases, hidden_size}
*
* for LSTM, bias has 4 gates:
* bias = bias_ih + bias_hh
*
* for GRU, bias has 4 gates:
* (PyTorch GRU bias) (MKLDNN GRU bias)
* bias_ih bias_hh bias
* +-----+ +-----+ +---------+
* | rt1 | | rt2 | | zt1+zt2 |
* |-----| |-----| |---------|
* | zt1 | | zt2 | | rt1+rt2 |
* |-----| |-----| |---------|
* | nt1 | | nt2 | | nt1 |
* +-----+ +-----+ |---------|
* | nt2 |
* +---------+
*
* PyTorch RNN weight format:
* a list of length num_layers * num_directions:
* {
* weight_ih_00, weight_hh_00, bias_ih_00, bias_hh_00 // layer = 0, direction = 0
* weight_ih_01, weight_hh_01, bias_ih_01, bias_hh_01 // layer = 0, direction = 1
* ..., ..., ..., ...,
* weight_ih_ld, weight_hh_ld, bias_ih_ld, bias_hh_ld // layer = l, direction = d
* }
* weight_ih_ld: {num_gates * hidden_size, input_size}
* weight_hh_ld: {num_gates * hidden_size, hidden_size}
* bias_ih_ld: {num_gates * hidden_size}
* bias_hh_ld: {num_gates * hidden_size}
*/MKLDNN RNN improves LSTM inference performance upto 5x, use benchmark to reproduce the result. The benchmark is using input_size=250, hidden_size=200 and run with single socket (20 cores) and single core respectively.
For the scenario of time_step=1 and single core inference, memory allocation consumes a considerable amount of time (~1/3), use jemmalloc can significantly improve overall performance, follow wiki to compile libjemalloc.so. This will give you additional 30% performance boost, free launch.
### run original
./run_single_batch_inference.sh
### run mkldnn
./run_single_batch_inference.sh --mkldnn
### run original with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh
### run mkldnn with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh --mkldnnperformance result on Xeon 6148 (unit: sentences per second, the higher the better)
| time_step | cores | original | mkldnn | original (jemalloc) | mkldnn (jemalloc) | mkldnn v.s. original | mkldnn jemalloc boost |
|---|---|---|---|---|---|---|---|
| 15 | 20 | 629 | 3184 | 768 | 4114 | 5.06 | 1.29 |
| 15 | 1 | 807 | 2976 | 900 | 3676 | 3.69 | 1.24 |
| 1 | 1 | 5100 | 6653 | 5668 | 8418 | 1.30 | 1.27 |
To further improve the performance:
- mkldnn requires
hx,cxto be concat into one tensorsrc_iter, the concat insideideepis 3x thanat::cat. - correspondingly, mkldnn requires
dst_iterto be split intohy,cy, the splitat::chunkis inplace and take no time,ideep::splitteris a memory copy. - (done) double check whether
expandtanhis properly vectorized: from v0.20 on, elemwise ops in RNN is properly vectorized. - provide inplace conversion to cpu tensor and mkldnn tensor.
Hi Mingfei,
I am trying to use MKLDNN to accelerate LSTM inference on a dual core Xeon server, and got following error msg:
RuntimeError: mkldnn_linear: weight and bias need to be mkldnn layout
After many searches and debug I have no clue and didn't find any document about this error. Could you please give a guidance?
Thank you very much!
Xiao