- pytorch mkldnn integration prototype design
- mkldnn conv integration
- conv3d parallelization: vol2col, col2vol
- LSTM optimization non-fused: tanh/sigmoid parallelization
-
Create MKLDNN conda channel
-
MKLDNN tensor type
- create lib/THMKL?
The purpose is to further improve PyTorch CPU performance on both imperative path and jit path.
MKLDNN requires to reorder memory from plain
layout to blocked
layout to achieve optimal performance on CPU, e.g. from nchw
to nChw16c
, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:
input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)
These reorder
s takes about 50% of time on a typical ImageNet topology, e.g. ResNet50
. Also MKLDNN chose different blocked
format according to different input config from Convolution
, with nn.Conv2d
always output in plain
layout, subsequent layers (BatchNorm
, Pooling
) would only execute on plain
layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.
Hotspot analysis:
/opt/intel/vtune_amplifier/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -knob sampling-interval=10 --resume-after 5 -d 20 \
-- /home/mingfeim/pytorch/unit_tests/run.sh
/opt/intel/vtune_amplifier/bin64/amplxe-cl -archive -r $1
Interpret vtune log function names: e.g.
backups for PR19736 of topk() performance optimization on CPU.
Suppose input tensor has shape of [N, C]
, performance input.topk(K, sorted=Sorted)
for the followings scenarios:
nn.Embedding()
ref: TensorFlow
Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:
GLUE_DIR
to actual dataset path in run_inference.sh
.Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.
This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.
To use MKLDNN RNN in PyTorch:
example: how to enable mkl-dnn
RNN
import torch
from torch.utils import mkldnn as mkldnn_utils
This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.
PyTorch can be installed via different channels: conda
, pip
, docker
, source code
...
By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:
### check where your torch is installed
python -c 'import torch; print(torch.__path__)'
trace #30806 of torch.cat()
performance regression.
benchmark_all_test result, command line:
python -m benchmark_all_test --operators cat --tag_filter all
(pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all