airMeng’s gists

airMeng / Debugging Xbyak via GDB.md

Last active March 4, 2025 03:01

Debugging Xbyak via GDB.md

OneDNN teams suggests to use SDE to dump the JITTed code like the following:

You can dump the JITTed kernel via the following c++ code:

void dump(const void *code, size_t code_size)
{
    FILE *file = fopen("dump.bin", "wb+");
    if (file) {
        size_t unused = fwrite(code, code_size, 1, file);
        fclose(file);

airMeng / Xbyak Learning Note.md

Last active December 25, 2023 06:57

Xbyak Learning Note.md

Let's start with a naive case, the following Code define void* function which takes pointers and intergers as input and put the sum into address the fourth pointer points to.

#include <xbyak/xbyak_util.h>

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        // xbyak also provides advanced usage like StakeFrame
        // see xbyak/sample/sf_test.cpp for how to use other parameter
        // Xbyak::util::StackFrame sf(this, 4);

airMeng / Sparse pattern for AMX.md

Last active December 25, 2023 06:57

Sparse pattern for AMX.md

As we all know, AMX ISA introduces the tdpbf16dps , which does 16x32 matrix times 32x16 matrix as the following:

FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (a.colsb / 4) - 1                                                         // colsb => bytes per col, in BF16 case k = [0, 16)
		FOR n := 0 TO (dst.colsb / 4) - 1                                               // colsb => bytes per col, in BF16 case n = [0, 16)
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+0]) * FP32(b.row[k].bf16[2*n+0])
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+1]) * FP32(b.row[k].bf16[2*n+1])
		ENDFOR
	ENDFOR

airMeng / Sparsity pattern collection.md

Last active December 25, 2023 06:57

Sparsity pattern collection.md

Currently introduced by different ways, we will enable the following patterns for NLP ToolKits.

The first will be so-called 4x1 pattern, which I have clarified enough in this gist.

The second is for AMX, so called x16 pattern, which is clarified here

The above sparse pattern is accessible based on the current INC pruning

airMeng / Sparse Pattern for VNNI.md

Last active June 28, 2022 08:55

Sparse Pattern for VNNI

As we all know, sparse patterns must align with target ISA, especially GEMM instruction. VNNI introduces the following GEMM example:

Description Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst. Operation

FOR j := 0 to 15
	tmp1.word := Signed(ZeroExtend16(a.byte[4*j]) * SignExtend16(b.byte[4*j]))
	tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]))

Meng, Hengyu airMeng