Skip to content

Instantly share code, notes, and snippets.

View airMeng's full-sized avatar
🇸🇭
I will not serve

Meng, Hengyu airMeng

🇸🇭
I will not serve
View GitHub Profile
@airMeng
airMeng / Debugging Xbyak via GDB.md
Last active March 4, 2025 03:01
Debugging Xbyak via GDB.md

OneDNN teams suggests to use SDE to dump the JITTed code like the following:

You can dump the JITTed kernel via the following c++ code:

void dump(const void *code, size_t code_size)
{
    FILE *file = fopen("dump.bin", "wb+");
    if (file) {
        size_t unused = fwrite(code, code_size, 1, file);
        fclose(file);
@airMeng
airMeng / Xbyak Learning Note.md
Last active December 25, 2023 06:57
Xbyak Learning Note.md

Let's start with a naive case, the following Code define void* function which takes pointers and intergers as input and put the sum into address the fourth pointer points to.

#include <xbyak/xbyak_util.h>

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        // xbyak also provides advanced usage like StakeFrame
        // see xbyak/sample/sf_test.cpp for how to use other parameter
        // Xbyak::util::StackFrame sf(this, 4);
@airMeng
airMeng / Sparse pattern for AMX.md
Last active December 25, 2023 06:57
Sparse pattern for AMX.md

As we all know, AMX ISA introduces the tdpbf16dps , which does 16x32 matrix times 32x16 matrix as the following:

FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (a.colsb / 4) - 1                                                         // colsb => bytes per col, in BF16 case k = [0, 16)
		FOR n := 0 TO (dst.colsb / 4) - 1                                               // colsb => bytes per col, in BF16 case n = [0, 16)
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+0]) * FP32(b.row[k].bf16[2*n+0])
			tmp.fp32[n] += FP32(a.row[m].bf16[2*k+1]) * FP32(b.row[k].bf16[2*n+1])
		ENDFOR
	ENDFOR
@airMeng
airMeng / Sparsity pattern collection.md
Last active December 25, 2023 06:57
Sparsity pattern collection.md

Currently introduced by different ways, we will enable the following patterns for NLP ToolKits.

The first will be so-called 4x1 pattern, which I have clarified enough in this gist.

The second is for AMX, so called x16 pattern, which is clarified here

The above sparse pattern is accessible based on the current INC pruning

@airMeng
airMeng / Sparse Pattern for VNNI.md
Last active June 28, 2022 08:55
Sparse Pattern for VNNI

As we all know, sparse patterns must align with target ISA, especially GEMM instruction. VNNI introduces the following GEMM example:

Description Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst. Operation

FOR j := 0 to 15
	tmp1.word := Signed(ZeroExtend16(a.byte[4*j]) * SignExtend16(b.byte[4*j]))
	tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]))