- pytorch mkldnn integration prototype design
- mkldnn conv integration
- conv3d parallelization: vol2col, col2vol
- LSTM optimization non-fused: tanh/sigmoid parallelization
-
Create MKLDNN conda channel
-
MKLDNN tensor type
- create lib/THMKL?
The purpose is to further improve PyTorch CPU performance on both imperative path and jit path.
MKLDNN requires to reorder memory from plain layout to blocked layout to achieve optimal performance on CPU, e.g. from nchw to nChw16c, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:
input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)These reorders takes about 50% of time on a typical ImageNet topology, e.g. ResNet50. Also MKLDNN chose different blocked format according to different input config from Convolution, with nn.Conv2d always output in plain layout, subsequent layers (BatchNorm, Pooling) would only execute on plain layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.
Hotspot analysis:
/opt/intel/vtune_amplifier/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -knob sampling-interval=10 --resume-after 5 -d 20 \
-- /home/mingfeim/pytorch/unit_tests/run.sh
/opt/intel/vtune_amplifier/bin64/amplxe-cl -archive -r $1Interpret vtune log function names: e.g.
backups for PR19736 of topk() performance optimization on CPU.
Suppose input tensor has shape of [N, C], performance input.topk(K, sorted=Sorted) for the followings scenarios:
nn.Embedding()
ref: TensorFlow
Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:
GLUE_DIR to actual dataset path in run_inference.sh.Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.
This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.
To use MKLDNN RNN in PyTorch:
example: how to enable mkl-dnn RNN
import torch
from torch.utils import mkldnn as mkldnn_utilsThis file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.
Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.
PyTorch can be installed via different channels: conda, pip, docker, source code...
By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:
### check where your torch is installed
python -c 'import torch; print(torch.__path__)'trace #30806 of torch.cat() performance regression.
benchmark_all_test result, command line:
python -m benchmark_all_test --operators cat --tag_filter all
(pytorch-mingfei) [mingfeim@mlt-skx090 operator_benchmark]$ python -m benchmark_all_test --operators cat --tag_filter all