(Internal Tranining Material)
Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:
- How to find the bottleneck operator?
- How to trace source file of a particular operator?
- How do I indentify threading issues? (oversubscription)
- How do I tell a specific operator is running efficiently or not?
This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.
PyTorch has a built-in profiler in autograd module, aka. PyTorch autograd profiler. The usage is fairly simple, you can tell torch.autograd
engine to keep a record of execution time of each operator in the following way:
with torch.autograd.profiler.profile() as prof:
output = model(input)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
Code snippet is here, the torch.autograd.profiler
will record any PyTorch operator (including external operators registered in PyTorch as extension, e.g. _ROIAlign
from detectron2) but not foreign operators to PyTorch such as numpy. For CUDA profiling, you need to provide argument use_cuda=True
.
After running the profiler, you will get a basic understanding of what operator(s) are hotspots. For pssp-transformer
, you will get something like:
--------------------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
--------------------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
bmm 14.91% 35.895s 30.11% 72.485s 3.634ms 19944
div 12.09% 29.100s 24.52% 59.034s 1.076ms 54846
copy_ 11.07% 26.636s 13.41% 32.290s 231.292us 139608
mm 10.76% 25.891s 21.65% 52.112s 1.254ms 41550
mul 8.92% 21.477s 17.93% 43.157s 1.770ms 24376
bernoulli_ 7.35% 17.687s 14.72% 35.421s 3.996ms 8864
native_layer_norm_backward 6.69% 16.103s 13.39% 32.241s 5.820ms 5540
_softmax_backward_data 3.80% 9.145s 7.64% 18.379s 5.529ms 3324
_softmax 3.42% 8.228s 6.85% 16.491s 4.961ms 3324
masked_fill_ 2.64% 6.362s 5.26% 12.671s 1.906ms 6648
mkldnn_convolution_backward_weights 2.52% 6.061s 2.53% 6.096s 2.751ms 2216
mkldnn_convolution 2.49% 5.999s 5.00% 12.034s 2.715ms 4432
mkldnn_convolution_backward_input 1.96% 4.728s 1.98% 4.757s 2.147ms 2216
embedding_dense_backward 1.64% 3.953s 3.30% 7.949s 7.174ms 1108
add_ 1.32% 3.169s 2.69% 6.465s 40.992us 157716
div_ 1.31% 3.152s 4.00% 9.628s 724.096us 13296
sum 1.18% 2.847s 2.44% 5.863s 407.041us 14404
add 1.13% 2.724s 2.31% 5.565s 295.455us 18836
empty 0.50% 1.214s 0.59% 1.414s 7.353us 192240
mul_ 0.45% 1.084s 1.08% 2.597s 25.472us 101936
The table itself is quite self-explanatory. At the first glance, bmm()
seems to be the one to be blame for bad CPU performance. But actually, a few more factors still need to be taken into account:
- different input configs for each operator: the workload might calls
bmm()
multiple times, basically we want to know: where? the input tensor size? Since it is possible that 4K by 4K is running of high efficiency while 4K by 13 is taking opposite side. - operator efficiency: e.g. whether this operator is running with low efficiency?
For the first question, we need to find where this operator is used in this workload - for bmm()
, a simple grep
will do this job really well. But how about overloaded operators such as div
, this is a simple divide (/) in the python code. Here we will need python level profiling tool e.g. cprofile as shown in part 2.
For the second question, we need vtune to collect performance metrics such as CPI, bandwidth, etc. Also domain knowlege will be a strong plus, since I optimized torch.bmm()
before I know basically there is nothing else we can do about bmm()
. But native_layer_norm_backward
is a sequential kernel.
cProfile is Python built-in profiler which means anything in Python will be recorded. Usage:
python -m cProfile -o output.pstats <your_script.py> arg1 arg2 …
Once you get the output.pstats
file, you can use a very cool tool to convert the result into human-readable image - gprof2dot.
pip install graphviz
pip install gprof2dot
gprof2dot -f pstats output.pstats | dot -Tpng -o output.png
On pssp-transformer
, you will get a tree as shown in z1.png where you can get source code trace. Therefore it would be very easy to find the input configs.
vtune is the ultimate tool for CPU performance profiling, similar to nvprof on GPU. You can launch vtune with a GUI as well scripts. I have listed my usual collection script in another gist.
With openmp
threading metrics collected, it is very straight forward to witness how many active threads are there at any moment so as to identify oversubcription issue.
Usually when I determine to optimize a particular operator for a input size, I would write a micro benchmark to feed into vtune. vtune is able to provides many useful information and this is a huge topic which is beyond the discussion of this gist, also it requires a lot of personal expertise.
Most simple of all, vtune is able to collect performance metrics at instruction level. And usually CPI (cycles per instruction) should be below 0.5 or 1. Abnormal high CPI may result from: a) slow instruction (e.g. integer div); b) function call on critical path not inlined; c) memory access; d) construction/decontrusction of large object (e.g. TensorImpl, you should not frequently see this on PyTorch, it's very expensive), etc.
As for the workload pssp-transformer
, after going though the bunch of profiling listed above, the hotspots would be:
- ScaledDotProductAtention - once checking the input, weight tensor size, it is easy to find out that the memory footprint is too large and the consecutive operators (
bmm
,div
,softmax
) will flush cache everytime. Solution will be to fuse these operators together in a cache friendly manner. - LayerNorm backward is a sequential kernel. Solution is Optimize LayerNorm performance on CPU both forward and backward.