shenwei356/__mm_add_epi32.md

Created August 4, 2020 12:10

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/shenwei356/35d336dbb273c1e03e625b6034267c39.js"></script>
Save shenwei356/35d336dbb273c1e03e625b6034267c39 to your computer and use it in GitHub Desktop.

https://stackoverflow.com/questions/63242918/golang-assembly-implement-of-mm-add-epi32/

Raw

  8            .          .           TEXT ·__mm_add_epi32(SB),0,$0 
  9        640ms      640ms               VMOVDQU x+0(FP), Y0 
 10        5.62s      5.62s               VMOVDQU y+32(FP), Y1 
 11        4.81s      4.81s               VPADDD  Y1, Y0, Y0 
 12        1.16s      1.16s               VMOVDQU Y0, q+64(FP) 
 13        1.30s      1.30s               VZEROUPPER 
 14            .          .               RET

clausecker commented Aug 4, 2020

The time taken by specific instructions is often not really indicative of which instructions take how long due to the out-of-order nature of modern processors. The time pprof measures is instead the time the CPU is stuck on one instruction without being able to progress to the next one because all its resources are occupied. As soon as an appropriate execution unit is free, the CPU can proceed to the next instruction.

As I said earlier, if the whole loop you call this function in can be written in assembly, all these data moves can be eliminated and your code is likely going to be a lot faster. Writing an assembly function to wrap a single instruction like this is pretty pointless.

Author

shenwei356 commented Aug 4, 2020

I see. I post another thread.

Thanks for you sincere advice again. I'll try to learn assembly, which is so useful for improving performance.