dougallj’s gists

dougallj / draw-patterns.c

Last active January 22, 2022 13:11

All distinct 4x4 patterns

	#define STB_IMAGE_WRITE_IMPLEMENTATION
	#include "stb_image_write.h"

	#define WIDTH_IN_BLOCKS 29
	#define HEIGHT_IN_BLOCKS 28

	#define PADDING 4

	#define BLOCK_WIDTH (4 * 4)
	#define BLOCK_HEIGHT (4 * 4)

dougallj / asm.s

Created January 3, 2018 08:55

x86-64 Speculative Execution Harness

dougallj / gist:9211fd24c3759f7f340dede28929c659

Last active June 5, 2024 04:21 — forked from zwegner/gist:6841688fa897aef64a11016967c36f2d

Ternary logic multiplication (0, 1, unknown)

	N_BITS = 8
	MASK = (1 << N_BITS) - 1

	class Ternary:
	def __init__(self, ones, unknowns):
	self.ones = ones & MASK
	self.unknowns = unknowns & MASK
	assert (self.ones & self.unknowns) == 0, (bin(self.ones), bin(self.unknowns))

	def __add__(self, other):

dougallj / aarch64_amx.py

Last active August 12, 2025 08:47

	# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
	#
	# WIP research. (This was edited to add more info after someone posted it to
	# Hacker News. Click "Revisions" to see full changes.)
	#
	# Copyright (c) 2020 dougallj


	# Based on Python port of VMX intrinsics plugin:
	# Copyright (c) 2019 w4kfu - Synacktiv

dougallj / a-readme.txt

Last active November 6, 2023 19:20

	Raw data. These were dumped from iPhones/iPads using wall-timers, not
	perf-counters. They contain some likely issues and inconsistencies that
	haven't been fully investigated. Mostly correct, but it's worth
	double-checking anything odd. (For example, "TBL (two register table)"
	can have better throughput than is listed sometimes, as can some other
	three-operand SIMD things iirc.)

	The goal is to find the fastest rate at which an instruction can run. If
	there are multiple rows with the same label, the "correct" value is the
	minimum. For example:

dougallj / 0-readme.txt

Last active February 21, 2023 09:01

AArch64 ternary logic optimisation tables (VPTERNLOGD -> ARM SIMD/Crypto/SVE/Scalar)

	Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
	On Intel you can pass the result directly to VPTERNLOGD. On A64, look up the value
	in the following tables to find a short, equivalent sequence of operations.

	Entries selected for throughput, not latency (though generally they seem to be
	optimal for both).

	I've only used a couple of entries and found them to be correct. Sorry if there are
	errors. Note that SVE changed the operand order to bsl (why???), so that's svbsl.
	Generally names are a mix between the opcodes and what I found readable (mostly

dougallj / 0-readme.txt

Last active February 26, 2025 08:11

Intel/AGX ternary logic optimisation tables (VPTERNLOGD -> AGX/x86/BMI/SSE2)

	Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
	On AVX-512 you can pass the result directly to VPTERNLOGD. On other platforms,
	look up the value in the following tables to find a short, equivalent sequence of
	operations.

	For A64/SVE/Neon see https://gist.github.com/dougallj/10c3ffdbd07229db2cc8b0430d7ccd39

	The tables here are:
	* agx: "not" and all binary operations (as used in Apple GPUs, but possibly useful elsewhere):

Dougall Johnson dougallj