8 . . TEXT ·__mm_add_epi32(SB),0,$0
9 640ms 640ms VMOVDQU x+0(FP), Y0
10 5.62s 5.62s VMOVDQU y+32(FP), Y1
11 4.81s 4.81s VPADDD Y1, Y0, Y0
12 1.16s 1.16s VMOVDQU Y0, q+64(FP)
13 1.30s 1.30s VZEROUPPER
14 . . RET
Created
August 4, 2020 12:10
-
-
Save shenwei356/35d336dbb273c1e03e625b6034267c39 to your computer and use it in GitHub Desktop.
I see. I post another thread.
Thanks for you sincere advice again. I'll try to learn assembly, which is so useful for improving performance.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The time taken by specific instructions is often not really indicative of which instructions take how long due to the out-of-order nature of modern processors. The time pprof measures is instead the time the CPU is stuck on one instruction without being able to progress to the next one because all its resources are occupied. As soon as an appropriate execution unit is free, the CPU can proceed to the next instruction.
As I said earlier, if the whole loop you call this function in can be written in assembly, all these data moves can be eliminated and your code is likely going to be a lot faster. Writing an assembly function to wrap a single instruction like this is pretty pointless.