A simple C program using gcc/clang extension:
#include <stdint.h>
typedef int32_t v4si __attribute__ ((vector_size (16)));
int main() {
v4si a = { 1, 2, 3, 4 };
v4si b = { 1, 2, 3, 4 };
v4si c = a + b;
}Typedef syntax is described here.
We are essentially creating a vector type of 16 bytes (128 bits) of base type int32_t (typedef int32_t[4] v4si).
To study the dissambly or this code:
cc -S -g -fverbose-asm test.cwill produce test.s. Look for the main function:
# test.c:6: v4si a = { 1, 2, 3, 4 };
.loc 1 6 0
movdqa .LC0(%rip), %xmm0 #, tmp89
movaps %xmm0, -48(%rbp) # tmp89, a
# test.c:7: v4si b = { 1, 2, 3, 4 };
.loc 1 7 0
movdqa .LC0(%rip), %xmm0 #, tmp90
movaps %xmm0, -32(%rbp) # tmp90, b
# test.c:8: v4si c = a + b;
.loc 1 8 0
movdqa -48(%rbp), %xmm0 # a, tmp92
paddd -32(%rbp), %xmm0 # b, tmp91
movaps %xmm0, -16(%rbp) # tmp91, c
movl $0, %eax #, _4
We notice the use of movdqa or paddd instructions which are SSE2. Clang or GCC will produce equivalent code.
If we increase the size of the vector (and the initialization of the variables) to 256 bits:
#include <stdint.h>
typedef int v4si __attribute__ ((vector_size (32)));
int main() {
v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
v4si b = { 1, 2, 3, 4, 5, 6, 7, 8 };
v4si c = a + b;
}With GCC, the binary degerates into using 'standard' op:
# test.c:6: v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
.loc 1 6 0
movl $1, -112(%rbp) #, a
movl $2, -108(%rbp) #, a
movl $3, -104(%rbp) #, a
movl $4, -100(%rbp) #, a
movl $5, -96(%rbp) #, a
movl $6, -92(%rbp) #, a
movl $7, -88(%rbp) #, a
movl $8, -84(%rbp) #, a
But GCC seems to cheat here. Because the result is pre-calculated and then stored in the executable data:
.LC0:
.long 2
.long 4
.long 6
.long 8
.align 16
.LC1:
.long 10
.long 12
.long 14
.long 16
And then loaded with SSE2 instructions:
# test.c:8: v4si c = a + b;
.loc 1 8 0
movdqa .LC0(%rip), %xmm0 #, tmp89
movaps %xmm0, -48(%rbp) # tmp89, c
movdqa .LC1(%rip), %xmm0 #, tmp90
movaps %xmm0, -32(%rbp) # tmp90, c
One has to assume that if GCC where not able to play tricks and actually performs the computation, it would probably behave badly.
Clang does not cheat but seems smarter and will use two calls to a SSE2 instructions:
movaps .LCPI0_0(%rip), %xmm0 # xmm0 = [5,6,7,8]
movaps %xmm0, 80(%rsp)
movaps .LCPI0_1(%rip), %xmm1 # xmm1 = [1,2,3,4]
movaps %xmm1, 64(%rsp)
Now, if we activate AVX instruction using the -mavx switch with gcc:
# test.c:6: v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
.loc 1 6 0
vmovdqa .LC0(%rip), %ymm0 #, tmp89
vmovdqa %ymm0, -112(%rbp) # tmp89, a
# test.c:7: v4si b = { 1, 2, 3, 4, 5, 6, 7, 8 };
.loc 1 7 0
vmovdqa .LC0(%rip), %ymm0 #, tmp90
vmovdqa %ymm0, -80(%rbp) # tmp90, b
# test.c:8: v4si c = a + b;
.loc 1 8 0
vmovdqa .LC1(%rip), %xmm0 #, tmp91
vmovaps %xmm0, -48(%rbp) # tmp91, c
vmovdqa .LC2(%rip), %xmm0 #, tmp92
vmovaps %xmm0, -32(%rbp) # tmp92, c
movl $0, %eax #, _4
GCC will be using AVX instructions.
Clang accepts the -mavx switch and generates code using AVX instructions but it looks quite different:
.loc 1 6 8 prologue_end # test.c:6:8
vmovaps .LCPI0_0(%rip), %ymm0 # ymm0 = [1,2,3,4,5,6,7,8]
vmovaps %ymm0, 64(%rsp)
.loc 1 7 8 # test.c:7:8
vmovaps %ymm0, 32(%rsp)
.loc 1 8 12 # test.c:8:12
vmovaps 64(%rsp), %ymm0
.loc 1 8 16 is_stmt 0 # test.c:8:16
vmovaps 32(%rsp), %ymm1
.loc 1 8 14 # test.c:8:14
vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm3
vpaddd %xmm2, %xmm3, %xmm2
vmovaps %xmm1, %xmm3
vmovaps %xmm0, %xmm4
vpaddd %xmm3, %xmm4, %xmm3
# implicit-def: %ymm0
vmovaps %xmm3, %xmm0
vinsertf128 $1, %xmm2, %ymm0, %ymm0
.loc 1 8 8 # test.c:8:8
vmovdqa %ymm0, (%rsp)
Way more instructions, but still AVX. But why is Clang produces more instructions? Again GCC is cheating. If you look below in the GCC output, into the data definitions, you will see GCC precomputes the result.
Let's expand the vector further to 512 bits:
typedef int v4si __attribute__ ((vector_size (64)));
GCC regress to a string of movl instructions, as before whereas Clang is still smart enough to use two AVX instructions.
Activating the -mavx512f leads to an interesting situation. Both compilers are using AVX512 instructions but differently.
GCC yields:
# test.c:6: v4si a = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
.loc 1 6 0
vmovdqa64 .LC0(%rip), %zmm0 #, tmp89
vmovdqa64 %zmm0, -240(%rbp) # tmp89, a
# test.c:7: v4si b = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
.loc 1 7 0
vmovdqa64 .LC0(%rip), %zmm0 #, tmp90
vmovdqa64 %zmm0, -176(%rbp) # tmp90, b
# test.c:8: v4si c = a + b;
.loc 1 8 0
vmovdqa64 -240(%rbp), %zmm0 # a, tmp92
vpaddd -176(%rbp), %zmm0, %zmm0 # b, tmp92, tmp91
vmovdqa64 %zmm0, -112(%rbp) # tmp91, c
Whereas Clang yields:
.loc 1 6 8 prologue_end # test.c:6:8
vmovdqa32 .LCPI0_0(%rip), %zmm0 # zmm0 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
vmovdqa32 %zmm0, 128(%rsp)
.loc 1 7 8 # test.c:7:8
vmovdqa32 %zmm0, 64(%rsp)
.loc 1 8 12 # test.c:8:12
vmovdqa64 128(%rsp), %zmm0
.loc 1 8 14 is_stmt 0 # test.c:8:14
vpaddd 64(%rsp), %zmm0, %zmm0
.loc 1 8 8 # test.c:8:8
vmovdqa64 %zmm0, (%rsp)
Not sure how this impacts performance though.
Note1:
GCC version: 7.5.0-3ubuntu1~18.04
Clang version: 6.0.0-1ubuntu2
Note2:
To ease up analysis, use https://godbolt.org/