MKLDNN RNN integration in PyTorch

This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.

To use MKLDNN RNN in PyTorch:

convert model to mkldnn
(optional) convert input and hx/cx to mkldnn

example: how to enable mkl-dnn RNN

import torch
from torch.utils import mkldnn as mkldnn_utils

# replace LSTM with MkldnnLSTM
rnn = torch.nn.LSTM(10, 20)
mkldnn_rnn = mkldnn_utils.to_mkldnn(rnn)

# random input
input = torch.randn(1, 5, 10)
hx = torch.randn(1, 5, 20)
cx = torch.randn(1, 5, 20)

# (optional) convert INPUTs into mkldnn layout
# Logic here is that
#   a) if the input/hx/cx is mkldnn layout, output/hy/cy will be mkldnn layout
#   b) if the input/hx/cx is dense layout, output/hy/cy will be dense layout
# to_mkldnn() is a out place memory copy, try to avoid do this for very iteration
# from performance pespective
input = input.to_mkldnn()
hx = hx.to_mkldnn()
cx = cx.to_mkldnn()

# evaluation
output, hidden = mkldnn_rnn(input, (hx, cx))
hy, cy = hidden

MKLDNN RNN APIs

MKLDNN RNN has some quite special presettings that differs from PyTorch:

 /* MKLDNN RNN weight format:
  * mkldnn expects 3 tensor for all layers/directions:
  *   weight_ih (ldigo): {num_layers, num_directions, input_size, num_gates, hidden_size}
  *   weight_hh (ldigo): {num_layers, num_directions, hidden_size, num_gates, hidden_size}
  *   bias (ldgo): {num_layers, num_directions, num_biases, hidden_size}
  *
  * for LSTM, bias has 4 gates:
  *   bias = bias_ih + bias_hh
  *
  * for GRU, bias has 4 gates:
  *   (PyTorch GRU bias)     (MKLDNN GRU bias)
  *   bias_ih    bias_hh          bias
  *   +-----+    +-----+       +---------+
  *   | rt1 |    | rt2 |       | zt1+zt2 |
  *   |-----|    |-----|       |---------|
  *   | zt1 |    | zt2 |       | rt1+rt2 |
  *   |-----|    |-----|       |---------|
  *   | nt1 |    | nt2 |       |   nt1   |
  *   +-----+    +-----+       |---------|
  *                            |   nt2   |
  *                            +---------+
  *
  * PyTorch RNN weight format:
  *   a list of length num_layers * num_directions:
  *   {
  *     weight_ih_00, weight_hh_00, bias_ih_00, bias_hh_00 // layer = 0, direction = 0
  *     weight_ih_01, weight_hh_01, bias_ih_01, bias_hh_01 // layer = 0, direction = 1
  *     ..., ..., ..., ...,
  *     weight_ih_ld, weight_hh_ld, bias_ih_ld, bias_hh_ld // layer = l, direction = d
  *   }
  *   weight_ih_ld: {num_gates * hidden_size, input_size}
  *   weight_hh_ld: {num_gates * hidden_size, hidden_size}
  *   bias_ih_ld: {num_gates * hidden_size}
  *   bias_hh_ld: {num_gates * hidden_size}
  */

Performance Improvements

MKLDNN RNN improves LSTM inference performance upto 5x, use benchmark to reproduce the result. The benchmark is using input_size=250, hidden_size=200 and run with single socket (20 cores) and single core respectively.

For the scenario of time_step=1 and single core inference, memory allocation consumes a considerable amount of time (~1/3), use jemmalloc can significantly improve overall performance, follow wiki to compile libjemalloc.so. This will give you additional 30% performance boost, free launch.

### run original
./run_single_batch_inference.sh

### run mkldnn
./run_single_batch_inference.sh --mkldnn

### run original with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh

### run mkldnn with jemalloc
LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./run_single_batch_inference.sh --mkldnn

performance result on Xeon 6148 (unit: sentences per second, the higher the better)

time_step	cores	original	mkldnn	original (jemalloc)	mkldnn (jemalloc)	mkldnn v.s. original	mkldnn jemalloc boost
15	20	629	3184	768	4114	5.06	1.29
15	1	807	2976	900	3676	3.69	1.24
1	1	5100	6653	5668	8418	1.30	1.27

Future Work

To further improve the performance:

mkldnn requires hx, cx to be concat into one tensor src_iter, the concat inside ideep is 3x than at::cat.
correspondingly, mkldnn requires dst_iter to be split into hy, cy, the split at::chunk is inplace and take no time, ideep::splitter is a memory copy.
(done) double check whether exp and tanh is properly vectorized: from v0.20 on, elemwise ops in RNN is properly vectorized.
provide inplace conversion to cpu tensor and mkldnn tensor.

mingfeima/rnn_perf_optimization.md

MKLDNN RNN APIs

Performance Improvements

Future Work

XiaoShen666-git commented Mar 31, 2021

mingfeima commented Apr 1, 2021