mingfeima / rnn_perf_optimization.md

Last active April 20, 2026 05:55

MKLDNN RNN integration in PyTorch

This gist keeps a record of MKLDNN RNN integration job into PyTorch and serves a backup of PR26387, only inference feature is provided at the moment.

To use MKLDNN RNN in PyTorch:

convert model to mkldnn
(optional) convert input and hx/cx to mkldnn

example: how to enable mkl-dnn RNN

import torch
from torch.utils import mkldnn as mkldnn_utils

mingfeima / bert_optimization.md

Last active July 8, 2022 06:13

BERT Optimization

benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

mingfeima / embedding_optimization.md

Created May 27, 2019 01:28

Recommendation

nn.Embedding()

ref: TensorFlow

mingfeima / topk.md

Last active July 2, 2019 02:43

topk_optimization_backups

backups for PR19736 of topk() performance optimization on CPU.

description

Suppose input tensor has shape of [N, C], performance input.topk(K, sorted=Sorted) for the followings scenarios:

C = 10000, 40000, 320000
K = 10, 50, 100, C/10, C/2, C-5
Test with 20 threads and 1 thread
Test with Sorted=True and Sorted=False

mingfeima / [BKM] VTune.md

Last active May 22, 2019 01:42

vtune tips

Hotspot analysis:

/opt/intel/vtune_amplifier/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -knob sampling-interval=10 --resume-after 5 -d 20 \
  -- /home/mingfeim/pytorch/unit_tests/run.sh
/opt/intel/vtune_amplifier/bin64/amplxe-cl -archive -r $1

Interpret vtune log function names: e.g.

mingfeima / mkldnn_integration_plan.md

Last active May 15, 2020 20:03

mkldnn integration plan, RFC draft

MKL-DNN Integration Plan

The purpose is to further improve PyTorch CPU performance on both imperative path and jit path. MKLDNN requires to reorder memory from plain layout to blocked layout to achieve optimal performance on CPU, e.g. from nchw to nChw16c, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:

input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)

These reorders takes about 50% of time on a typical ImageNet topology, e.g. ResNet50. Also MKLDNN chose different blocked format according to different input config from Convolution, with nn.Conv2d always output in plain layout, subsequent layers (BatchNorm, Pooling) would only execute on plain layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.

mingfeima / pytorch_perf_optimization_cpu.md

Last active December 29, 2017 01:51

PyTorch Performance Optimization on CPU

pytorch mkldnn integration prototype design

mkldnn conv integration
conv3d parallelization: vol2col, col2vol
LSTM optimization non-fused: tanh/sigmoid parallelization

Create MKLDNN conda channel
MKLDNN tensor type

create lib/THMKL?

Ma Mingfei mingfeima

benchmark

MKL v.s. MKLDNN

description

MKL-DNN Integration Plan

PyTorch Performance Optimization on CPU