Skip to content

Instantly share code, notes, and snippets.

@dougallj
dougallj / draw-patterns.c
Last active January 22, 2022 13:11
All distinct 4x4 patterns
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
#define WIDTH_IN_BLOCKS 29
#define HEIGHT_IN_BLOCKS 28
#define PADDING 4
#define BLOCK_WIDTH (4 * 4)
#define BLOCK_HEIGHT (4 * 4)
@dougallj
dougallj / asm.s
Created January 3, 2018 08:55
x86-64 Speculative Execution Harness
global _time_load
global _cache_flush
global _run_attempt
extern _bools
extern _values
extern _pointers
section .text
@dougallj
dougallj / gist:9211fd24c3759f7f340dede28929c659
Last active June 5, 2024 04:21 — forked from zwegner/gist:6841688fa897aef64a11016967c36f2d
Ternary logic multiplication (0, 1, unknown)
N_BITS = 8
MASK = (1 << N_BITS) - 1
class Ternary:
def __init__(self, ones, unknowns):
self.ones = ones & MASK
self.unknowns = unknowns & MASK
assert (self.ones & self.unknowns) == 0, (bin(self.ones), bin(self.unknowns))
def __add__(self, other):
# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
#
# WIP research. (This was edited to add more info after someone posted it to
# Hacker News. Click "Revisions" to see full changes.)
#
# Copyright (c) 2020 dougallj
# Based on Python port of VMX intrinsics plugin:
# Copyright (c) 2019 w4kfu - Synacktiv
Raw data. These were dumped from iPhones/iPads using wall-timers, not
perf-counters. They contain some likely issues and inconsistencies that
haven't been fully investigated. Mostly correct, but it's worth
double-checking anything odd. (For example, "TBL (two register table)"
can have better throughput than is listed sometimes, as can some other
three-operand SIMD things iirc.)
The goal is to find the fastest rate at which an instruction can run. If
there are multiple rows with the same label, the "correct" value is the
minimum. For example:
@dougallj
dougallj / 0-readme.txt
Last active February 21, 2023 09:01
AArch64 ternary logic optimisation tables (VPTERNLOGD -> ARM SIMD/Crypto/SVE/Scalar)
Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
On Intel you can pass the result directly to VPTERNLOGD. On A64, look up the value
in the following tables to find a short, equivalent sequence of operations.
Entries selected for throughput, not latency (though generally they seem to be
optimal for both).
I've only used a couple of entries and found them to be correct. Sorry if there are
errors. Note that SVE changed the operand order to bsl (why???), so that's svbsl.
Generally names are a mix between the opcodes and what I found readable (mostly
@dougallj
dougallj / 0-readme.txt
Last active November 6, 2023 19:20
Intel/AGX ternary logic optimisation tables (VPTERNLOGD -> AGX/x86/BMI/SSE2)
Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
On AVX-512 you can pass the result directly to VPTERNLOGD. On other platforms,
look up the value in the following tables to find a short, equivalent sequence of
operations.
For A64/SVE/Neon see https://gist.github.com/dougallj/10c3ffdbd07229db2cc8b0430d7ccd39
The tables here are:
* agx: "not" and all binary operations (as used in Apple GPUs, but possibly useful elsewhere):