8 . . TEXT ·__mm_add_epi32(SB),0,$0
9 640ms 640ms VMOVDQU x+0(FP), Y0
10 5.62s 5.62s VMOVDQU y+32(FP), Y1
11 4.81s 4.81s VPADDD Y1, Y0, Y0
12 1.16s 1.16s VMOVDQU Y0, q+64(FP)
13 1.30s 1.30s VZEROUPPER
14 . . RET
-
-
Save shenwei356/35d336dbb273c1e03e625b6034267c39 to your computer and use it in GitHub Desktop.
The time taken by specific instructions is often not really indicative of which instructions take how long due to the out-of-order nature of modern processors. The time pprof measures is instead the time the CPU is stuck on one instruction without being able to progress to the next one because all its resources are occupied. As soon as an appropriate execution unit is free, the CPU can proceed to the next instruction.
As I said earlier, if the whole loop you call this function in can be written in assembly, all these data moves can be eliminated and your code is likely going to be a lot faster. Writing an assembly function to wrap a single instruction like this is pretty pointless.
I see. I post another thread.
Thanks for you sincere advice again. I'll try to learn assembly, which is so useful for improving performance.
Why retrieving the second parameter (L10) is much slower than the first one (L9)?