Skip to content

Instantly share code, notes, and snippets.

@jdmichaud
Last active March 30, 2020 09:49
Show Gist options
  • Save jdmichaud/59b0de1b911148deed5ff68b4f84610e to your computer and use it in GitHub Desktop.
Save jdmichaud/59b0de1b911148deed5ff68b4f84610e to your computer and use it in GitHub Desktop.

128 bits and SSE2

A simple C program using gcc/clang extension:

#include <stdint.h>

typedef int32_t v4si __attribute__ ((vector_size (16)));

int main() {
  v4si a = { 1, 2, 3, 4 };
  v4si b = { 1, 2, 3, 4 };
  v4si c = a + b;
}

Typedef syntax is described here. We are essentially creating a vector type of 16 bytes (128 bits) of base type int32_t (typedef int32_t[4] v4si).

To study the dissambly or this code:

cc -S -g -fverbose-asm test.c

will produce test.s. Look for the main function:

# test.c:6:   v4si a = { 1, 2, 3, 4 };
        .loc 1 6 0
        movdqa  .LC0(%rip), %xmm0       #, tmp89
        movaps  %xmm0, -48(%rbp)        # tmp89, a
# test.c:7:   v4si b = { 1, 2, 3, 4 };
        .loc 1 7 0
        movdqa  .LC0(%rip), %xmm0       #, tmp90
        movaps  %xmm0, -32(%rbp)        # tmp90, b
# test.c:8:   v4si c = a + b;
        .loc 1 8 0
        movdqa  -48(%rbp), %xmm0        # a, tmp92
        paddd   -32(%rbp), %xmm0        # b, tmp91
        movaps  %xmm0, -16(%rbp)        # tmp91, c
        movl    $0, %eax        #, _4

We notice the use of movdqa or paddd instructions which are SSE2. Clang or GCC will produce equivalent code.

256 bits and AVX

If we increase the size of the vector (and the initialization of the variables) to 256 bits:

#include <stdint.h>

typedef int v4si __attribute__ ((vector_size (32)));

int main() {
  v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
  v4si b = { 1, 2, 3, 4, 5, 6, 7, 8 };
  v4si c = a + b;
}

With GCC, the binary degerates into using 'standard' op:

# test.c:6:   v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
        .loc 1 6 0
        movl    $1, -112(%rbp)  #, a
        movl    $2, -108(%rbp)  #, a
        movl    $3, -104(%rbp)  #, a
        movl    $4, -100(%rbp)  #, a
        movl    $5, -96(%rbp)   #, a
        movl    $6, -92(%rbp)   #, a
        movl    $7, -88(%rbp)   #, a
        movl    $8, -84(%rbp)   #, a

But GCC seems to cheat here. Because the result is pre-calculated and then stored in the executable data:

.LC0:
        .long   2
        .long   4
        .long   6
        .long   8
        .align 16
.LC1:
        .long   10
        .long   12
        .long   14
        .long   16

And then loaded with SSE2 instructions:

# test.c:8:   v4si c = a + b;
        .loc 1 8 0
        movdqa  .LC0(%rip), %xmm0       #, tmp89
        movaps  %xmm0, -48(%rbp)        # tmp89, c
        movdqa  .LC1(%rip), %xmm0       #, tmp90
        movaps  %xmm0, -32(%rbp)        # tmp90, c

One has to assume that if GCC where not able to play tricks and actually performs the computation, it would probably behave badly.

Clang does not cheat but seems smarter and will use two calls to a SSE2 instructions:

        movaps  .LCPI0_0(%rip), %xmm0   # xmm0 = [5,6,7,8]
        movaps  %xmm0, 80(%rsp)
        movaps  .LCPI0_1(%rip), %xmm1   # xmm1 = [1,2,3,4]
        movaps  %xmm1, 64(%rsp)

Now, if we activate AVX instruction using the -mavx switch with gcc:

# test.c:6:   v4si a = { 1, 2, 3, 4, 5, 6, 7, 8 };
        .loc 1 6 0
        vmovdqa .LC0(%rip), %ymm0       #, tmp89
        vmovdqa %ymm0, -112(%rbp)       # tmp89, a
# test.c:7:   v4si b = { 1, 2, 3, 4, 5, 6, 7, 8 };
        .loc 1 7 0
        vmovdqa .LC0(%rip), %ymm0       #, tmp90
        vmovdqa %ymm0, -80(%rbp)        # tmp90, b
# test.c:8:   v4si c = a + b;
        .loc 1 8 0
        vmovdqa .LC1(%rip), %xmm0       #, tmp91
        vmovaps %xmm0, -48(%rbp)        # tmp91, c
        vmovdqa .LC2(%rip), %xmm0       #, tmp92
        vmovaps %xmm0, -32(%rbp)        # tmp92, c
        movl    $0, %eax        #, _4

GCC will be using AVX instructions.

Clang accepts the -mavx switch and generates code using AVX instructions but it looks quite different:

        .loc    1 6 8 prologue_end      # test.c:6:8
        vmovaps .LCPI0_0(%rip), %ymm0   # ymm0 = [1,2,3,4,5,6,7,8]
        vmovaps %ymm0, 64(%rsp)
        .loc    1 7 8                   # test.c:7:8
        vmovaps %ymm0, 32(%rsp)
        .loc    1 8 12                  # test.c:8:12
        vmovaps 64(%rsp), %ymm0
        .loc    1 8 16 is_stmt 0        # test.c:8:16
        vmovaps 32(%rsp), %ymm1
        .loc    1 8 14                  # test.c:8:14
        vextractf128    $1, %ymm1, %xmm2
        vextractf128    $1, %ymm0, %xmm3
        vpaddd  %xmm2, %xmm3, %xmm2
        vmovaps %xmm1, %xmm3
        vmovaps %xmm0, %xmm4
        vpaddd  %xmm3, %xmm4, %xmm3
                                        # implicit-def: %ymm0
        vmovaps %xmm3, %xmm0
        vinsertf128     $1, %xmm2, %ymm0, %ymm0
        .loc    1 8 8                   # test.c:8:8
        vmovdqa %ymm0, (%rsp)

Way more instructions, but still AVX. But why is Clang produces more instructions? Again GCC is cheating. If you look below in the GCC output, into the data definitions, you will see GCC precomputes the result.

512 bits and AVX512

Let's expand the vector further to 512 bits:

typedef int v4si __attribute__ ((vector_size (64)));

GCC regress to a string of movl instructions, as before whereas Clang is still smart enough to use two AVX instructions. Activating the -mavx512f leads to an interesting situation. Both compilers are using AVX512 instructions but differently. GCC yields:

# test.c:6:   v4si a = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
        .loc 1 6 0
        vmovdqa64       .LC0(%rip), %zmm0       #, tmp89
        vmovdqa64       %zmm0, -240(%rbp)       # tmp89, a
# test.c:7:   v4si b = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
        .loc 1 7 0
        vmovdqa64       .LC0(%rip), %zmm0       #, tmp90
        vmovdqa64       %zmm0, -176(%rbp)       # tmp90, b
# test.c:8:   v4si c = a + b;
        .loc 1 8 0
        vmovdqa64       -240(%rbp), %zmm0       # a, tmp92
        vpaddd  -176(%rbp), %zmm0, %zmm0        # b, tmp92, tmp91
        vmovdqa64       %zmm0, -112(%rbp)       # tmp91, c

Whereas Clang yields:

        .loc    1 6 8 prologue_end      # test.c:6:8
        vmovdqa32       .LCPI0_0(%rip), %zmm0 # zmm0 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
        vmovdqa32       %zmm0, 128(%rsp)
        .loc    1 7 8                   # test.c:7:8
        vmovdqa32       %zmm0, 64(%rsp)
        .loc    1 8 12                  # test.c:8:12
        vmovdqa64       128(%rsp), %zmm0
        .loc    1 8 14 is_stmt 0        # test.c:8:14
        vpaddd  64(%rsp), %zmm0, %zmm0
        .loc    1 8 8                   # test.c:8:8
        vmovdqa64       %zmm0, (%rsp)

Not sure how this impacts performance though.

Note1:

GCC version: 7.5.0-3ubuntu1~18.04

Clang version: 6.0.0-1ubuntu2

Note2:

To ease up analysis, use https://godbolt.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment