Skip to content

Instantly share code, notes, and snippets.

@x0nu11byt3
x0nu11byt3 / elf_format_cheatsheet.md
Created February 27, 2021 05:26
ELF Format Cheatsheet

ELF Format Cheatsheet

Introduction

Executable and Linkable Format (ELF), is the default binary format on Linux-based systems.

ELF

Compilation

@animetosho
animetosho / gf2p8affineqb-articles.md
Last active May 12, 2025 14:29
A list of articles documenting uses of the GF2P8AFFINE instruction

Unexpected Uses for the Galois Field Affine Transformation Instruction

Intel added the Galois Field instruction set (GFNI) extensions to their Sunny Cove and Tremont cores. What’s particularly interesting is that GFNI is the only new SIMD extension that came with SSE and VEX/AVX encodings (in addition to EVEX/AVX512), to allow it to be supported on all future Intel cores, including those which don’t support AVX512 (such as the Atom line, as well as Celeron/Pentium branded “big” cores).

I suspect GFNI was aimed at accelerating SM4 encryption, however, one of the instructions can be used for many other purposes. The extension includes three instructions, but of particular interest here is the Affine Transformation (GF2P8AFFINEQB), aka bit-matrix multiply, instruction.

There have been various articles which discuss out-of-band

@animetosho
animetosho / galois-field-affine-uses.md
Last active May 12, 2025 11:25
A list of “out-of-band” uses for the GF2P8AFFINEQB instruction I haven’t seen documented elsewhere

Count Leading/Trailing Zero Bits (Byte-wise)

Counting the trailing zero bit count (TZCNT) can be done by isolating the lowest bit, then depositing this into the appropriate locations for the count. The leading zero bit count (LZCNT) can be done by reversing bits, then computing the TZCNT.

__m128i _mm_tzcnt_epi8(__m128i a) {
	// isolate lowest bit
	a = _mm_andnot_si128(_mm_add_epi8(a, _mm_set1_epi8(0xff)), a);
	// convert lowest bit to index
@vegard
vegard / quantize-tikz.cc
Created April 15, 2020 11:34
Float to byte quantisation
#if 0
(g++-9 $0 || g++ $0) && \
./a.out > output.tex && \
pdflatex output && \
exec convert -density 400 -flatten output.pdf -resize 25% output.png
exit 1
#endif
#include <cmath>
#include <cstdio>
// x64 encoding
enum Reg {
RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI,
R8, R9, R10, R11, R12, R13, R14, R15,
};
enum XmmReg {
XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7,
XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15,

The trick to designing transpose algorithms for both small and large problems is to recognize their simple recursive structure.

For a matrix A, let's denote its transpose by T(A) as a shorthand. First, suppose A is a 2x2 matrix:

    [A00 A01]
A = [A10 A11]

Then we have:

@pkhuong
pkhuong / signal-on-instruction-count.c
Created February 18, 2020 02:21
Trigger a unix signal based on the HW instruction counter PMC
#define RUN_ME /*
exec cc -g -ggdb -O2 -W -Wall -std=c99 $0 -o "$(basename $0 .c)"
*/
/*
* Copyright 2020 Paul Khuong
* SPDX-License-Identifier: BSD-2-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@pkhuong
pkhuong / libbts.c
Last active October 18, 2022 09:01
minimal BTS tracing wrapper for linux perf
#define RUN_ME /*
exec cc -O2 -W -Wall -std=c99 -shared $0 -o "$(basename $0 .c).so" -fPIC
*/
/*
* Copyright 2019 Paul Khuong
* SPDX-License-Identifier: BSD-2-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@travisdowns
travisdowns / cache-counters-rant.md
Created October 13, 2019 16:46
Discussion of x86 L1D related cache counters

The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.

These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).

That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

  • It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
  • It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and d