ashok-arjun/pytorch_performance_profiling.md

(Internal Tranining Material)

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:

How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I indentify threading issues? (oversubscription)
How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

PyTorch Autograd Profiler

PyTorch has a built-in profiler in autograd module, aka. PyTorch autograd profiler. The usage is fairly simple, you can tell torch.autograd engine to keep a record of execution time of each operator in the following way:

with torch.autograd.profiler.profile() as prof:
    output = model(input)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

Code snippet is here, the torch.autograd.profiler will record any PyTorch operator (including external operators registered in PyTorch as extension, e.g. _ROIAlign from detectron2) but not foreign operators to PyTorch such as numpy. For CUDA profiling, you need to provide argument use_cuda=True.

After running the profiler, you will get a basic understanding of what operator(s) are hotspots. For pssp-transformer, you will get something like:

---------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
---------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
bmm                                      14.91%           35.895s          30.11%           72.485s          3.634ms          19944
div                                      12.09%           29.100s          24.52%           59.034s          1.076ms          54846
copy_                                    11.07%           26.636s          13.41%           32.290s          231.292us        139608
mm                                       10.76%           25.891s          21.65%           52.112s          1.254ms          41550
mul                                      8.92%            21.477s          17.93%           43.157s          1.770ms          24376
bernoulli_                               7.35%            17.687s          14.72%           35.421s          3.996ms          8864
native_layer_norm_backward               6.69%            16.103s          13.39%           32.241s          5.820ms          5540
_softmax_backward_data                   3.80%            9.145s           7.64%            18.379s          5.529ms          3324
_softmax                                 3.42%            8.228s           6.85%            16.491s          4.961ms          3324
masked_fill_                             2.64%            6.362s           5.26%            12.671s          1.906ms          6648
mkldnn_convolution_backward_weights      2.52%            6.061s           2.53%            6.096s           2.751ms          2216
mkldnn_convolution                       2.49%            5.999s           5.00%            12.034s          2.715ms          4432
mkldnn_convolution_backward_input        1.96%            4.728s           1.98%            4.757s           2.147ms          2216
embedding_dense_backward                 1.64%            3.953s           3.30%            7.949s           7.174ms          1108
add_                                     1.32%            3.169s           2.69%            6.465s           40.992us         157716
div_                                     1.31%            3.152s           4.00%            9.628s           724.096us        13296
sum                                      1.18%            2.847s           2.44%            5.863s           407.041us        14404
add                                      1.13%            2.724s           2.31%            5.565s           295.455us        18836
empty                                    0.50%            1.214s           0.59%            1.414s           7.353us          192240
mul_                                     0.45%            1.084s           1.08%            2.597s           25.472us         101936

The table itself is quite self-explanatory. At the first glance, bmm() seems to be the one to be blame for bad CPU performance. But actually, a few more factors still need to be taken into account:

different input configs for each operator: the workload might calls bmm() multiple times, basically we want to know: where? the input tensor size? Since it is possible that 4K by 4K is running of high efficiency while 4K by 13 is taking opposite side.
operator efficiency: e.g. whether this operator is running with low efficiency?

For the first question, we need to find where this operator is used in this workload - for bmm(), a simple grep will do this job really well. But how about overloaded operators such as div, this is a simple divide (/) in the python code. Here we will need python level profiling tool e.g. cprofile as shown in part 2.

For the second question, we need vtune to collect performance metrics such as CPI, bandwidth, etc. Also domain knowlege will be a strong plus, since I optimized torch.bmm() before I know basically there is nothing else we can do about bmm(). But native_layer_norm_backward is a sequential kernel.

cProfile Profiler

cProfile is Python built-in profiler which means anything in Python will be recorded. Usage:

python -m cProfile -o output.pstats <your_script.py> arg1 arg2 …

Once you get the output.pstats file, you can use a very cool tool to convert the result into human-readable image - gprof2dot.

pip install graphviz
pip install gprof2dot
gprof2dot -f pstats output.pstats | dot -Tpng -o output.png

On pssp-transformer, you will get a tree as shown in z1.png where you can get source code trace. Therefore it would be very easy to find the input configs.

VTune

vtune is the ultimate tool for CPU performance profiling, similar to nvprof on GPU. You can launch vtune with a GUI as well scripts. I have listed my usual collection script in another gist.

With openmp threading metrics collected, it is very straight forward to witness how many active threads are there at any moment so as to identify oversubcription issue.

Usually when I determine to optimize a particular operator for a input size, I would write a micro benchmark to feed into vtune. vtune is able to provides many useful information and this is a huge topic which is beyond the discussion of this gist, also it requires a lot of personal expertise.

Most simple of all, vtune is able to collect performance metrics at instruction level. And usually CPI (cycles per instruction) should be below 0.5 or 1. Abnormal high CPI may result from: a) slow instruction (e.g. integer div); b) function call on critical path not inlined; c) memory access; d) construction/decontrusction of large object (e.g. TensorImpl, you should not frequently see this on PyTorch, it's very expensive), etc.

Closing

As for the workload pssp-transformer, after going though the bunch of profiling listed above, the hotspots would be:

ScaledDotProductAtention - once checking the input, weight tensor size, it is easy to find out that the memory footprint is too large and the consecutive operators (bmm, div, softmax) will flush cache everytime. Solution will be to fuse these operators together in a cache friendly manner.
LayerNorm backward is a sequential kernel. Solution is Optimize LayerNorm performance on CPU both forward and backward.