From A to X: Infrastructure for Assembly Optimization

From libass, to dav1d, to FFmpeg, to x264, and perhaps more projects I haven't noticed, these projects have a lot of handwritten assembly optimization and use the same pattern to build the assembly optimization infrastructure. This article introduces the composition of this infrastructure from three stages: coding, runtime, and test and benchmark.

Most people don't need this infrastructure, but they can learn some good engineering practices. Even if doing mundane work, it's good to understand how the pyramid is built.

Coding stage

Under FFmpeg libavutil, there are the following source files:

libavutil/aarch64/asm.S
libavutil/arm/asm.S
libavutil/riscv/asm.S
libavutil/x86/x86inc.asm

x264 is similar:

common/aarch64/asm.S
common/arm/asm.S
common/loongarch/loongson_asm.S
common/x86/x86inc.asm

One less RISC-V, one more Loongson.

dav1d merged arm/asm.S and aarch64/asm.S.

src/arm/asm.S
src/ext/x86/x86inc.asm
src/loongarch/loongson_asm.S
src/riscv/asm.S

Did these projects use copy-paste? Yes, they copied and pasted source code. Files with the same name have the same or overlapping authors. Which one is the original?

For x86, x264 is the original, and other projects copied from x264.
For arm, FFmpeg is the original, and the dav1d project rewrote it with significant changes.
For RISC-V, FFmpeg and dav1d are written independently.
For Loongson, the developers submitted a set of code to multiple projects simultaneously, all in the original version.

What is the purpose of asm.S? asm.S helps developers solve the following problems:

How to define a function
How to define constants
How to handle multiple extension architectures to reduce code duplication
And others like PIC addressing methods

Take the macro for defining a function in arm/asm.S as an example:

.macro  function name, export=0, align=2
    .macro endfunc
ELF     .size   \name, . - \name
FUNC    .endfunc
        .purgem endfunc
    .endm
        .text
        .align          \align
    .if \export
        .global EXTERN_ASM\name
ELF     .type   EXTERN_ASM\name, %function
FUNC    .func   EXTERN_ASM\name
EXTERN_ASM\name:
        AARCH64_VALID_CALL_TARGET
    .else
ELF     .type   \name, %function
FUNC    .func   \name
\name:
    .endif
.endm

Here, two macros are defined: function marks the start of the function, endfunc marks the end of the function
ELF is used to handle elf format (executable format used by Linux and BSD) and non-elf format, the definition of ELF itself is

#ifdef __ELF__
#   define ELF
#else
#   define ELF #
#endif

That is, for non-ELF formats, comment out ".size \name, . - \name". Handle FUNC similarly to ELF.

export controls whether the function is static or extern
There is also an EXTERN_ASM\name, used to add a prefix to function symbols

Explain EXTERN_ASM in detail. We know that C++ has mangle and demangle issues. C language does not require name mangling due to the lack of overload, but C language can also use name mangling. A typical example is the Windows system,

int _cdecl    f (int x) { return 0; }
int _stdcall  g (int y) { return 0; }
int _fastcall h (int z) { return 0; }

The symbols compiled for x86 32-bit are:

_f
_g@4
@h@4

The above examples are from the wiki. Simple examples on macOS:

static void foo() { }

int main() { foo(); }

Symbols generated by compilation:

0000000100000000 T __mh_execute_header
0000000100003f9c t _foo
0000000100003f84 T _main

In other words, without EXTERN_ASM to smooth out the handling of C language name mangling on different platforms, the written code will fail to find symbols during linking or report errors during compilation.

In short, asm.S eliminates system platform differences, providing the most basic facilities for assembly programming. For those learning assembly programming, start by copying an asm.S file.

Runtime

Some developers face relatively simple runtime environments and often hardcode CPU architecture information into their code projects, assuming the CPU always supports AVX2 extensions. Once in the client environment, encountering old devices from over a decade ago, the code compiled with hardcoded CPU architecture won't run.

When doing assembly optimization, multiple versions may be implemented for different CPU instruction set extensions during the coding phase to achieve the highest performance on each CPU. For example, FFmpeg H.265 decoding's motion compensation has one version implemented with ARM neon instructions, another with dotprod extensions, and x86 acceleration has SSE4 and AVX2 versions.

Programs optimized with assembly need to avoid crashing and maximize performance at runtime, requiring:

Detecting CPU capabilities.

Allocating the corresponding assembly acceleration version based on the CPU-supported instruction set extensions, meaning assembly acceleration optimization needs runtime dynamic dispatch capabilities.

2.1 CPU detection

FFmpeg's libavutil/cpu.c and libavutil/$(arch)/cpu.c implement runtime CPU detection. There are many methods for CPU detection, such as the getauxval() API on Linux systems, but there is no universal method. The implementation is a patchwork of different systems and CPU architectures.

nihui implemented an almost universal CPU detection using a trial-and-error method, see https://github.com/nihui/ruapu. Running the corresponding extended instructions without exceptions indicates support, while catching an exception signal indicates no support. This method also has limitations, as handling signals can affect debugging. Additionally, vendor-specific extended instructions may conflict: the same machine code may represent different instructions. Yes, this magical thing can happen in RISC-V. Therefore, ruapu can be used as a small testing tool, but if embedded into a project as a library, a more conservative detection method is recommended. For projects already using FFmpeg, libavutil/cpu.h is sufficient.

2.2 dynamic dispatch

Functions that need assembly acceleration in FFmpeg are placed in a dsp.h header file, such as libavcodec/vvc/dsp.h, which lists the functions that need acceleration for H.266 decoding.

VVCDSPContext is divided according to the features of different decoding modules.

typedef struct VVCDSPContext {
    VVCInterDSPContext inter;
    VVCIntraDSPContext intra;
    VVCItxDSPContext itx;
    VVCLMCSDSPContext lmcs;
    VVCLFDSPContext lf;
    VVCSAODSPContext sao;
    VVCALFDSPContext alf;
} VVCDSPContext;

VVCInterDSPContext is a large set of APIs that need acceleration during the inter-frame motion compensation stage:

typedef struct VVCInterDSPContext {
    void (*put[2 /* luma, chroma */][7 /* log2(width) - 1 */][2 /* int, frac */][2 /* int, frac */])(
        int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, int height,
        const int8_t *hf, const int8_t *vf, int width);

    void (*put_uni[2 /* luma, chroma */][7 /* log2(width) - 1 */][2 /* int, frac */][2 /* int, frac */])(
        uint8_t *dst, ptrdiff_t dst_stride, const uint8_t *src, ptrdiff_t src_stride, int height,
        const int8_t *hf, const int8_t *vf, int width);

    void (*put_uni_w[2 /* luma, chroma */][7 /* log2(width) - 1 */][2 /* int, frac */][2 /* int, frac */])(
        uint8_t *dst, ptrdiff_t dst_stride, const uint8_t *src, ptrdiff_t src_stride, int height,
        int denom, int wx, int ox, const int8_t *hf, const int8_t *vf, int width);

Omitted below

When assigning dsp function pointers, follow this order:

First, assign a set of C function implementations
If the CPU supports older extension instruction set acceleration, such as mmx, assign the function pointers to this implementation
If the CPU supports newer extension instruction set acceleration, such as SSE4, assign the function pointers to this implementation.
If the CPU supports newer or the latest extension instruction sets, assign them in sequence.

For example:

        if (EXTERNAL_MMX(cpu_flags)) {
            if (chroma_format_idc <= 1) {
            } else {
                c->h264_idct_add8 = ff_h264_idct_add8_422_8_mmx;
            }
        }
        if (EXTERNAL_MMXEXT(cpu_flags)) {
            c->h264_idct8_dc_add = ff_h264_idct8_dc_add_8_mmxext;

            c->weight_h264_pixels_tab[2] = ff_h264_weight_4_mmxext;

            c->biweight_h264_pixels_tab[2] = ff_h264_biweight_4_mmxext;
        }
        if (EXTERNAL_SSE2(cpu_flags)) {
            c->h264_idct8_add  = ff_h264_idct8_add_8_sse2;

            c->h264_idct_add16 = ff_h264_idct_add16_8_sse2;
            // 5000 words omitted
            // ...
        }
        if (EXTERNAL_SSSE3(cpu_flags)) {
            c->biweight_h264_pixels_tab[0] = ff_h264_biweight_16_ssse3;
            c->biweight_h264_pixels_tab[1] = ff_h264_biweight_8_ssse3;
        }
        if (EXTERNAL_AVX(cpu_flags)) {
            c->h264_v_loop_filter_luma       = ff_deblock_v_luma_8_avx;
            c->h264_h_loop_filter_luma       = ff_deblock_h_luma_8_avx;
            c->h264_v_loop_filter_luma_intra = ff_deblock_v_luma_intra_8_avx;
            c->h264_h_loop_filter_luma_intra = ff_deblock_h_luma_intra_8_avx;
#if ARCH_X86_64
            c->h264_h_loop_filter_luma_mbaff = ff_deblock_h_luma_mbaff_8_avx;
#endif
            // 10000 words omitted
            // ...

This approach ensures that in the worst case, the desired extensions are not available, and there is a fallback C code implementation to prevent missing assignments. Ultimately, it can select the fastest assembly acceleration implementation that matches the CPU.

Test (test and benchmark)

General programs may not require function-level test coverage, but assembly code must undergo function-level testing: fixing bugs in C code is easy, but if assembly code is not thoroughly tested during development, locating and fixing bugs later is extremely difficult.

Function-level testing of assembly code serves another purpose: to verify and optimize acceleration effect. Writing assembly is not just about ensuring correctness, but about speeding up the process. It's not a matter of simply finishing with ":wq", "ZZ", or ":x"; it's a process of repeated benchmarking and speed optimization.

checkasm is built for these two purposes:

test the correctness of assembly code
benchmark the speed of assembly code

checkasm was initially developed by x264 and later ported to FFmpeg and other projects. Earlier, we discussed how dynamic dispatch is implemented based on cpu detection information. If you look at libavutil/cpu.h, besides querying CPU capabilities, there is also a forced setting of CPU flags:

/**
 * Return the flags which specify extensions supported by the CPU.
 * The returned value is affected by av_force_cpu_flags() if that was used
 * before. So av_get_cpu_flags() can easily be used in an application to
 * detect the enabled cpu flags.
 */
int av_get_cpu_flags(void);

/**
 * Disables cpu detection and forces the specified flags.
 * -1 is a special case that disables forcing of specific flags.
 */
void av_force_cpu_flags(int flags);

If you want to check the speed of decoding H.265 after disabling all assembly acceleration, you don't need to modify the code and recompile. You just need to use the ability to force set CPU flags:

ffmpeg -cpuflags 0 -threads 1 -i ~/Movies/4min-265.mp4 -an -t 60  -f null - -benchmark

frame= 1440 fps=102 q=-0.0 Lsize=N/A time=00:01:00.00 bitrate=N/A speed=4.25x

As a comparison:

ffmpeg -threads 1 -i ~/Movies/4min-265.mp4 -an -t 60  -f null - -benchmark

frame= 1440 fps=258 q=-0.0 Lsize=N/A time=00:01:00.00 bitrate=N/A speed=10.7x

The purpose of this ability is for testing:

Set different CPU flags at runtime, and dynamic dispatch will bind to different code implementations. The most basic is the C version.
Use the same parameters to call the functions implemented with different CPU flags. If they return the same results as the C version, it indicates the program logic is correct. Otherwise, it indicates a defect in the assembly acceleration code.
- One exception is the floating-point operation test, where the C version and the assembly acceleration version may not be exactly the same.
Run the loop multiple times (default is 1024 times), compare the "time consumption" of the C version with that of different assembly versions to obtain the quantitative effect of assembly acceleration. Core code:

    check_cpu_flag(NULL, 0);
    for (i = 0; cpus[i].flag; i++)
        check_cpu_flag(cpus[i].name, cpus[i].flag);

check_cpu_flag runs all tests for the corresponding CPU flag:

/* Perform tests and benchmarks for the specified cpu flag if supported by the host */
static void check_cpu_flag(const char *name, int flag)
{
    int old_cpu_flag = state.cpu_flag;

    flag |= old_cpu_flag;
    av_force_cpu_flags(-1);
    state.cpu_flag = flag & av_get_cpu_flags();
    av_force_cpu_flags(state.cpu_flag);

    if (!flag || state.cpu_flag != old_cpu_flag) {
        int i;

        state.cpu_flag_name = name;
        for (i = 0; tests[i].func; i++) {
            if (state.test_pattern && wildstrcmp(tests[i].name, state.test_pattern))
                continue;
            state.current_test_name = tests[i].name;
            tests[i].func();
        }
    }
}

The idea is simple, with one crucial detail: how to measure the "time consumption" of a function's execution?

In fact, the implementation of checkasm avoids using real time (unless other methods cannot run on the platform), instead using CPU cycles. Variations in system load, scheduling interference, and CPU dynamic frequency scaling can cause fluctuations in program runtime. The runtime of functions with the smallest granularity for assembly acceleration is very short, and using time functions like clock_gettime to measure program performance introduces too much noise, making the results unreliable. Applications may not have direct access to read CPU cycles. The Linux kernel provides the perf_event_open system call, and macOS has the private kperf available. The absolute values measured are not important; what matters is the ratio between the assembly accelerated version and the C version, i.e., the acceleration ratio.

Additionally, there is another detail in the benchmark: when submitting patches to FFmpeg, the version with compiler auto-vectorization disabled is generally used as the benchmark result, comparing a naive C version with a manual optimized version. Different compilers have varying vectorization effects, and enabling compiler vectorization introduces more uncertainties, making it difficult for others to reproduce and compare horizontally. Therefore, the assembly acceleration effects shown in the FFmpeg git log may be higher.

An example of running the checkasm benchmark:

apply_bdof_8_8x16_c:                                  7315.2 ( 1.00x)
apply_bdof_8_8x16_neon:                               1876.8 ( 3.90x)
apply_bdof_8_16x8_c:                                  7170.5 ( 1.00x)
apply_bdof_8_16x8_neon:                               1752.8 ( 4.09x)
apply_bdof_8_16x16_c:                                14695.2 ( 1.00x)
apply_bdof_8_16x16_neon:                              3490.5 ( 4.21x)
apply_bdof_10_8x16_c:                                 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon:                              1863.8 ( 3.96x)
apply_bdof_10_16x8_c:                                 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon:                              1766.0 ( 4.06x)
apply_bdof_10_16x16_c:                               14551.5 ( 1.00x)
apply_bdof_10_16x16_neon:                             3576.0 ( 4.07x)
apply_bdof_12_8x16_c:                                 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon:                              1863.8 ( 3.88x)
apply_bdof_12_16x8_c:                                 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon:                              1758.8 ( 4.16x)
apply_bdof_12_16x16_c:                               14691.2 ( 1.00x)
apply_bdof_12_16x16_neon:                             3480.5 ( 4.22x)

The _c suffix is the C version
The _neon suffix is the ARM neon accelerated version
The second-to-last column is the relative "time consumption," i.e., CPU cycles.
The last column is the acceleration ratio of the assembly version compared to the C version, generally around 4x.

PS: There is still a lot of optimization space in this version. I'm working on it.

quink-black/asm-opt.md