Skip to content

Instantly share code, notes, and snippets.

@grahamking
grahamking / runperf
Last active November 7, 2023 21:17
Bash script to run a benchmark under decent conditons.
#!/bin/bash
#
# Usage: runperf ./my-benchmark-binary
#
# Script to run a benchmark / performance test in decent conditions. Based on:
# - https://www.llvm.org/docs/Benchmarking.html
# - "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov, Appendix A.
# - https://github.com/andikleen/pmu-tools
#
# Note that this doesn't do any actual benchmarking, your binary must be able to do that all by itself.
@mlliarm
mlliarm / gists.md
Last active November 8, 2022 02:50
Gists
@acutmore
acutmore / README.md
Last active February 27, 2025 19:55
Emulating a 4-Bit Virtual Machine in (TypeScript\JavaScript) (just Types no Script)

A compile-time 4-Bit Virtual Machine implemented in TypeScript's type system. Capable of running a sample 'FizzBuzz' program.

Syntax emits zero JavaScript.

type RESULT = VM<
  [
    ["push", N_1],         // 1
    ["push", False],       // 2
 ["peek", _], // 3

Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and

@nadavrot
nadavrot / Matrix.md
Last active April 16, 2025 20:31
Efficient matrix multiplication

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

@jboner
jboner / latency.txt
Last active April 17, 2025 13:47
Latency Numbers Every Programmer Should Know
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD