_(Internal Tranining Material)_ Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get: * How to find the bottleneck operator? * How to trace source file of a particular operator? * How do I indentify threading issues? [(oversubscription)](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/thread-oversubscription.html) * How do I tell a specific operator is running efficiently or not? This tutorial takes one of my recent projects - [pssp-transformer](https://github.com/mingfeima/pssp) as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2. ### PyTorch Autograd Profiler PyTorch has a built-in [profiler](https://pytorch.org/docs/stable/autograd.html#profiler) in autograd module, aka. PyTorch autograd profiler. The usage is fairly simple, you can tell `torch.autograd` engine to keep a record of execution time of each operator in the following way: ```python with torch.autograd.profiler.profile() as prof: output = model(input) print(prof.key_averages().table(sort_by="self_cpu_time_total")) ``` Code snippet is [here](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/main.py#L184), the `torch.autograd.profiler` will record any PyTorch operator (including external operators registered in PyTorch as extension, e.g. `_ROIAlign` from detectron2) but **not** foreign operators to PyTorch such as **numpy**. For CUDA profiling, you need to provide argument `use_cuda=True`. After running the profiler, you will get a basic understanding of what operator(s) are hotspots. For `pssp-transformer`, you will get something like: ```bash --------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls --------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- bmm 14.91% 35.895s 30.11% 72.485s 3.634ms 19944 div 12.09% 29.100s 24.52% 59.034s 1.076ms 54846 copy_ 11.07% 26.636s 13.41% 32.290s 231.292us 139608 mm 10.76% 25.891s 21.65% 52.112s 1.254ms 41550 mul 8.92% 21.477s 17.93% 43.157s 1.770ms 24376 bernoulli_ 7.35% 17.687s 14.72% 35.421s 3.996ms 8864 native_layer_norm_backward 6.69% 16.103s 13.39% 32.241s 5.820ms 5540 _softmax_backward_data 3.80% 9.145s 7.64% 18.379s 5.529ms 3324 _softmax 3.42% 8.228s 6.85% 16.491s 4.961ms 3324 masked_fill_ 2.64% 6.362s 5.26% 12.671s 1.906ms 6648 mkldnn_convolution_backward_weights 2.52% 6.061s 2.53% 6.096s 2.751ms 2216 mkldnn_convolution 2.49% 5.999s 5.00% 12.034s 2.715ms 4432 mkldnn_convolution_backward_input 1.96% 4.728s 1.98% 4.757s 2.147ms 2216 embedding_dense_backward 1.64% 3.953s 3.30% 7.949s 7.174ms 1108 add_ 1.32% 3.169s 2.69% 6.465s 40.992us 157716 div_ 1.31% 3.152s 4.00% 9.628s 724.096us 13296 sum 1.18% 2.847s 2.44% 5.863s 407.041us 14404 add 1.13% 2.724s 2.31% 5.565s 295.455us 18836 empty 0.50% 1.214s 0.59% 1.414s 7.353us 192240 mul_ 0.45% 1.084s 1.08% 2.597s 25.472us 101936 ``` The table itself is quite self-explanatory. At the first glance, `bmm()` seems to be the one to be blame for _bad_ CPU performance. But actually, a few more factors still need to be taken into account: * **different input configs for each operator**: the workload might calls `bmm()` multiple times, basically we want to know: where? the input tensor size? Since it is possible that 4K by 4K is running of high efficiency while 4K by 13 is taking opposite side. * **operator efficiency**: e.g. whether this operator is running with low efficiency? For the first question, we need to find where this operator is used in this workload - for `bmm()`, a simple `grep` will do this job really well. But how about overloaded operators such as `div`, this is a simple divide (/) in the python code. Here we will need python level profiling tool e.g. **cprofile** as shown in part 2. For the second question, we need **vtune** to collect performance metrics such as CPI, bandwidth, etc. Also domain knowlege will be a strong plus, since I optimized `torch.bmm()` before I know basically there is nothing else we can do about `bmm()`. But `native_layer_norm_backward` is a sequential kernel. ### cProfile Profiler [cProfile](https://docs.python.org/3/library/profile.html) is Python built-in profiler which means anything in Python will be recorded. Usage: ```bash python -m cProfile -o output.pstats <your_script.py> arg1 arg2 … ``` Once you get the `output.pstats` file, you can use a very cool tool to convert the result into human-readable image - [gprof2dot](https://github.com/jrfonseca/gprof2dot). ```bash pip install graphviz pip install gprof2dot gprof2dot -f pstats output.pstats | dot -Tpng -o output.png ``` On `pssp-transformer`, you will get a tree as shown in **z1.png** where you can get source code trace. Therefore it would be very easy to find the input configs.  ### VTune [vtune](https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html) is the ultimate tool for CPU performance profiling, similar to **nvprof** on GPU. You can launch vtune with a GUI as well scripts. I have listed my usual collection script in another [gist](https://gist.github.com/mingfeima/032d625e8230ca8e208f597074d38ec9). With `openmp` threading metrics collected, it is very straight forward to witness how many active threads are there at any moment so as to identify oversubcription issue. Usually when I determine to optimize a particular operator for a input size, I would write a micro benchmark to feed into vtune. vtune is able to provides many useful information and this is a huge topic which is beyond the discussion of this gist, also it requires a lot of personal expertise. Most simple of all, vtune is able to collect performance metrics at instruction level. And usually CPI (cycles per instruction) should be below 0.5 or 1. Abnormal high CPI may result from: a) slow instruction (e.g. integer div); b) function call on critical path not inlined; c) memory access; d) construction/decontrusction of large object (e.g. TensorImpl, you should not frequently see this on PyTorch, it's very expensive), etc. ### Closing As for the workload `pssp-transformer`, after going though the bunch of profiling listed above, the hotspots would be: * [ScaledDotProductAtention](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/transformer/Modules.py#L16) - once checking the input, weight tensor size, it is easy to find out that the memory footprint is too large and the consecutive operators (`bmm`, `div`, `softmax`) will flush cache everytime. Solution will be to fuse these operators together in a cache friendly manner. * [LayerNorm](https://github.com/mingfeima/pssp/blob/master/pssp-transformer/transformer/SubLayers.py#L27) backward is a sequential kernel. Solution is [Optimize LayerNorm performance on CPU both forward and backward](https://github.com/pytorch/pytorch/pull/35750).