The purpose is to further improve PyTorch CPU performance on both imperative path and jit path.
MKLDNN requires to reorder memory from plain layout to blocked layout to achieve optimal performance on CPU, e.g. from nchw to nChw16c, etc. At this moment on PyTorch, MKLDNN operators reuse CPU tensor, which means for each MKLDNN operator, it takes three steps to finish the computation:
input_reorder(plain_layout, blocked_layout)
mkldnn_computation()
output_reorder(blocked_layout, plain_layout)
These reorders takes about 50% of time on a typical ImageNet topology, e.g. ResNet50. Also MKLDNN chose different blocked format according to different input config from Convolution, with nn.Conv2d always output in plain layout, subsequent layers (BatchNorm, Pooling) would only execute on plain layout and this is the slow path for MKLDNN. With these problems solved, the CNN models would have 3~4x speedup v.s. current performance.