pretentious7

Gists

In reverse chronological order from gist creation:

2022-04-18: Multiple keyboard layouts on RaspberryOS 64 bit, Debian Bullseye.
2022-02-28: Youtube from the Linux CLI.
2022-02-26: How to use RSS to get updates from Twitter hashtags.
2022-02-25: Factoring 10...01.
2021-12-23: An oldtimer's APL Code (tradfn v., dfn v.).
2021-12-22: Testing SymPy.

A compile-time 4-Bit Virtual Machine implemented in TypeScript's type system. Capable of running a sample 'FizzBuzz' program.

Syntax emits zero JavaScript.

type RESULT = VM<
  [
    ["push", N_1],         // 1
    ["push", False],       // 2
 ["peek", _], // 3

Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

	#!/bin/bash
	#
	# Usage: runperf ./my-benchmark-binary
	#
	# Script to run a benchmark / performance test in decent conditions. Based on:
	# - https://www.llvm.org/docs/Benchmarking.html
	# - "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov, Appendix A.
	# - https://github.com/andikleen/pmu-tools
	#
	# Note that this doesn't do any actual benchmarking, your binary must be able to do that all by itself.

	Latency Comparison Numbers (~2012)
	----------------------------------
	L1 cache reference 0.5 ns
	Branch mispredict 5 ns
	L2 cache reference 7 ns 14x L1 cache
	Mutex lock/unlock 25 ns
	Main memory reference 100 ns 20x L2 cache, 200x L1 cache
	Compress 1K bytes with Zippy 3,000 ns 3 us
	Send 1K bytes over 1 Gbps network 10,000 ns 10 us
	Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD