Skip to content

Instantly share code, notes, and snippets.

@MangaD
Created March 14, 2025 07:59
Show Gist options
  • Select an option

  • Save MangaD/1fad63756ad8c946ce01dd1d52eff173 to your computer and use it in GitHub Desktop.

Select an option

Save MangaD/1fad63756ad8c946ce01dd1d52eff173 to your computer and use it in GitHub Desktop.
Comprehensive Guide to SIMD in C++

Comprehensive Guide to SIMD in C++

CC0

Disclaimer: ChatGPT generated document.

1. Introduction to SIMD

What is SIMD?

SIMD (Single Instruction, Multiple Data) is a parallel computing model where one instruction operates on multiple data elements simultaneously. It enables CPUs to process vectors of data in parallel, significantly improving performance for tasks like image processing, numerical computing, physics simulations, and cryptography.

How SIMD Works

Instead of processing one value at a time (scalar execution), SIMD allows processing multiple values in parallel:

  • Scalar Execution (Non-SIMD)

    for (int i = 0; i < 4; i++)
        result[i] = a[i] + b[i];  // 4 additions, 4 cycles
  • SIMD Execution

    • The CPU executes one instruction that processes four elements at once.
    • If using 256-bit registers, an AVX instruction can handle eight 32-bit floats at once.

2. SIMD Support in Modern CPUs

Most modern processors support SIMD through vector instruction sets:

SIMD Extension Bit Width Register Type CPUs Supporting
SSE (Streaming SIMD Extensions) 128-bit __m128 Intel Pentium III+ (1999), AMD Athlon+
SSE2, SSE3, SSSE3, SSE4 128-bit __m128d, __m128i Intel Pentium 4+, AMD K8+
AVX (Advanced Vector Extensions) 256-bit __m256 Intel Sandy Bridge+ (2011), AMD Bulldozer+
AVX2 256-bit __m256i Intel Haswell+ (2013), AMD Excavator+
AVX-512 512-bit __m512 Intel Skylake-X+ (2017), Xeon, AMD Zen 4

Checking SIMD Support

Use __builtin_cpu_supports() (GCC/Clang) or _may_i_use_cpu_feature() (MSVC) to detect SIMD capabilities at runtime.

#include <iostream>

int main() {
    #ifdef __AVX2__
        std::cout << "AVX2 supported!\n";
    #else
        std::cout << "AVX2 not supported.\n";
    #endif
}

3. Writing SIMD-Friendly C++ Code

Modern C++ compilers can auto-vectorize loops, but writing SIMD-aware code improves performance.

3.1. Enable Auto-Vectorization

Most modern compilers automatically vectorize loops if they detect SIMD opportunities.

βœ… Compiler Flags to Enable SIMD Optimization:

Compiler SIMD Optimization Flags
GCC/Clang -O2 -march=native or -O3
MSVC /O2 /arch:AVX2

Example:

g++ -O3 -march=native simd_example.cpp -o simd_example

πŸ’‘ -march=native enables highest supported SIMD on your CPU.


3.2. Writing SIMD-Friendly Loops

Compilers can auto-vectorize simple, independent loops.

βœ… Good: Simple Loop (Auto-Vectorizable)

#include <vector>
#include <iostream>

void multiply(std::vector<float>& data, float multiplier) {
    for (size_t i = 0; i < data.size(); i++)
        data[i] *= multiplier;
}

❌ Bad: Loop with Dependencies (Hard to Vectorize)

void sum_dependent(std::vector<float>& data) {
    for (size_t i = 1; i < data.size(); i++)  // Data dependency: data[i] depends on data[i-1]
        data[i] += data[i-1];
}

πŸ’‘ Avoid data dependencies in loops to allow auto-vectorization!


3.3. Use Compiler Hints (#pragma omp simd)

Use OpenMP SIMD pragmas to force vectorization.

#include <vector>
#include <iostream>

void multiply(std::vector<float>& data, float multiplier) {
    #pragma omp simd
    for (size_t i = 0; i < data.size(); i++)
        data[i] *= multiplier;
}

πŸ’‘ This forces SIMD execution even if the compiler is unsure.


3.4. Using Intrinsics for Manual SIMD

For fine-tuned control, use intrinsics (#include <immintrin.h>).

Example: Using AVX2 for Vector Addition

#include <immintrin.h>  // AVX2 header
#include <iostream>

void add_avx2(float* a, float* b, float* result, size_t n) {
    for (size_t i = 0; i < n; i += 8) {  // Process 8 floats at a time
        __m256 va = _mm256_loadu_ps(&a[i]);  // Load 8 floats
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vr = _mm256_add_ps(va, vb);   // Vectorized addition
        _mm256_storeu_ps(&result[i], vr);    // Store result
    }
}

βœ… Processes 8 elements per iteration instead of 1! ❌ Not portable (requires AVX2 support).


4. Alternative: Using std::valarray and SIMD Libraries

If you don't want to write intrinsics, SIMD-friendly alternatives exist.

4.1. std::valarray (Built-in SIMD Support)

The std::valarray class is SIMD-aware and enables vectorized operations.

#include <iostream>
#include <valarray>

int main() {
    std::valarray<float> a = {1, 2, 3, 4, 5, 6, 7, 8};
    std::valarray<float> b = {10, 20, 30, 40, 50, 60, 70, 80};
    std::valarray<float> c = a + b;  // SIMD-optimized

    for (auto x : c) std::cout << x << " ";  // 11, 22, 33, ...
}

πŸ’‘ Simple, portable, and auto-vectorized.


4.2. Libraries for SIMD

Instead of writing intrinsics, you can use SIMD abstraction libraries:

Example using Eigen:

#include <Eigen/Dense>
#include <iostream>

int main() {
    Eigen::Vector4f a(1, 2, 3, 4);
    Eigen::Vector4f b(5, 6, 7, 8);
    Eigen::Vector4f c = a + b;  // SIMD-optimized

    std::cout << c.transpose() << "\n";  // Output: 6 8 10 12
}

πŸ’‘ Simplifies SIMD programming without intrinsics.


5. Summary: Writing Better SIMD Code

Best Practice Why?
Enable auto-vectorization (-O3 -march=native) Lets the compiler optimize loops for SIMD.
Write SIMD-friendly loops (no dependencies) Allows efficient auto-vectorization.
Use OpenMP #pragma omp simd Forces SIMD execution for loops.
Use SIMD libraries (Eigen, xsimd) Abstracts away low-level intrinsics.
Use intrinsics (<immintrin.h>) Provides fine-grained SIMD control when needed.

πŸš€ SIMD can give a 2-10x speedup for numerical code! πŸš€

@dusturb
Copy link

dusturb commented Feb 4, 2026

This is an awesome summary! I knew nothing about SIMD and could follow along nonetheless, well done.

@soerlemans
Copy link

Great explanation thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment