_(Internal Tranining Material)_

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload.
This gist tells basic knowledge of performance profiling on PyTorch, you will get:
* How to find the bottleneck operator?
* How to trace source file of a particular operator?
* How do I indentify threading issues? [(oversubscription)](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/thread-oversubscription.html)
* How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - [pssp-transformer](https://github.com/mingfeima/pssp) as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

### PyTorch Autograd Profiler
PyTorch has a built-in [profiler](https://pytorch.org/docs/stable/autograd.html#profiler) in autograd module, aka. PyTorch autograd profiler. The usage is fairly simple, you can tell `torch.autograd` engine to keep a record of execution time of each operator in the following way:
```python
with torch.autograd.profiler.profile() as prof:
    output = model(input)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
```
Code snippet is [here](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/main.py#L184), the `torch.autograd.profiler` will record any PyTorch operator (including external operators registered in PyTorch as extension, e.g. `_ROIAlign` from detectron2) but **not** foreign operators to PyTorch such as **numpy**. For CUDA profiling, you need to provide argument `use_cuda=True`.

After running the profiler, you will get a basic understanding of what operator(s) are hotspots. For `pssp-transformer`, you will get something like:
```bash
---------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
---------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
bmm                                      14.91%           35.895s          30.11%           72.485s          3.634ms          19944
div                                      12.09%           29.100s          24.52%           59.034s          1.076ms          54846
copy_                                    11.07%           26.636s          13.41%           32.290s          231.292us        139608
mm                                       10.76%           25.891s          21.65%           52.112s          1.254ms          41550
mul                                      8.92%            21.477s          17.93%           43.157s          1.770ms          24376
bernoulli_                               7.35%            17.687s          14.72%           35.421s          3.996ms          8864
native_layer_norm_backward               6.69%            16.103s          13.39%           32.241s          5.820ms          5540
_softmax_backward_data                   3.80%            9.145s           7.64%            18.379s          5.529ms          3324
_softmax                                 3.42%            8.228s           6.85%            16.491s          4.961ms          3324
masked_fill_                             2.64%            6.362s           5.26%            12.671s          1.906ms          6648
mkldnn_convolution_backward_weights      2.52%            6.061s           2.53%            6.096s           2.751ms          2216
mkldnn_convolution                       2.49%            5.999s           5.00%            12.034s          2.715ms          4432
mkldnn_convolution_backward_input        1.96%            4.728s           1.98%            4.757s           2.147ms          2216
embedding_dense_backward                 1.64%            3.953s           3.30%            7.949s           7.174ms          1108
add_                                     1.32%            3.169s           2.69%            6.465s           40.992us         157716
div_                                     1.31%            3.152s           4.00%            9.628s           724.096us        13296
sum                                      1.18%            2.847s           2.44%            5.863s           407.041us        14404
add                                      1.13%            2.724s           2.31%            5.565s           295.455us        18836
empty                                    0.50%            1.214s           0.59%            1.414s           7.353us          192240
mul_                                     0.45%            1.084s           1.08%            2.597s           25.472us         101936
```
The table itself is quite self-explanatory. At the first glance, `bmm()` seems to be the one to be blame for _bad_ CPU performance. But actually, a few more factors still need to be taken into account:
* **different input configs for each operator**: the workload might calls `bmm()` multiple times, basically we want to know: where? the input tensor size? Since it is possible that 4K by 4K is running of high efficiency while 4K by 13 is taking opposite side.
* **operator efficiency**: e.g. whether this operator is running with low efficiency?

For the first question, we need to find where this operator is used in this workload - for `bmm()`, a simple `grep` will do this job really well. But how about overloaded operators such as `div`, this is a simple divide (/) in the python code. Here we will need python level profiling tool e.g. **cprofile** as shown in part 2.

For the second question, we need **vtune** to collect performance metrics such as CPI, bandwidth, etc. Also domain knowlege will be a strong plus, since I optimized `torch.bmm()` before I know basically there is nothing else we can do about `bmm()`. But `native_layer_norm_backward` is a sequential kernel.

### cProfile Profiler
[cProfile](https://docs.python.org/3/library/profile.html) is Python built-in profiler which means anything in Python will be recorded. Usage:
```bash
python -m cProfile -o output.pstats <your_script.py> arg1 arg2 …
```
Once you get the `output.pstats` file, you can use a very cool tool to convert the result into human-readable image - [gprof2dot](https://github.com/jrfonseca/gprof2dot).
```bash
pip install graphviz
pip install gprof2dot
gprof2dot -f pstats output.pstats | dot -Tpng -o output.png
```

On `pssp-transformer`, you will get a tree as shown in **z1.png** where you can get source code trace. Therefore it would be very easy to find the input configs.
![z_cprofile_train](https://user-images.githubusercontent.com/20233731/84990005-18971300-b177-11ea-9de6-e02aa1d640e6.png)

### VTune
[vtune](https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html) is the ultimate tool for CPU performance profiling, similar to **nvprof** on GPU. You can launch vtune with a GUI as well scripts. I have listed my usual collection script in another [gist](https://gist.github.com/mingfeima/032d625e8230ca8e208f597074d38ec9).

With `openmp` threading metrics collected, it is very straight forward to witness how many active threads are there at any moment so as to identify oversubcription issue.

Usually when I determine to optimize a particular operator for a input size, I would write a micro benchmark to feed into vtune. vtune is able to provides many useful information and this is a huge topic which is beyond the discussion of this gist, also it requires a lot of personal expertise.

Most simple of all, vtune is able to collect performance metrics at instruction level. And usually CPI (cycles per instruction) should be below 0.5 or 1. Abnormal high CPI may result from: a) slow instruction (e.g. integer div); b) function call on critical path not inlined; c) memory access; d) construction/decontrusction of large object (e.g. TensorImpl, you should not frequently see this on PyTorch, it's very expensive), etc.

### Closing
As for the workload `pssp-transformer`, after going though the bunch of profiling listed above, the hotspots would be:
* [ScaledDotProductAtention](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/transformer/Modules.py#L16) - once checking the input, weight tensor size, it is easy to find out that the memory footprint is too large and the consecutive operators (`bmm`, `div`, `softmax`) will flush cache everytime. Solution will be to fuse these operators together in a cache friendly manner.
* [LayerNorm](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/transformer/SubLayers.py#L27) backward is a sequential kernel. Solution is [Optimize LayerNorm performance on CPU both forward and backward](https://github.com/pytorch/pytorch/pull/35750).