JasonCC JasonCC

Default

Powerlevel10k

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

Program Analysis Resources

(draft; work in progress)

	Latency Comparison Numbers (~2012)
	----------------------------------
	L1 cache reference 0.5 ns
	Branch mispredict 5 ns
	L2 cache reference 7 ns 14x L1 cache
	Mutex lock/unlock 25 ns
	Main memory reference 100 ns 20x L2 cache, 200x L1 cache
	Compress 1K bytes with Zippy 3,000 ns 3 us
	Send 1K bytes over 1 Gbps network 10,000 ns 10 us
	Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD

	#define _CRT_SECURE_NO_DEPRECATE

	#include <stdio.h>
	#include <string.h>
	#include <Windows.h>

	// This allocates a "magic ring buffer" that is mapped twice, with the two
	// copies being contiguous in (virtual) memory. The advantage of this is
	// that this allows any function that expects data to be contiguous in
	// memory to read from (or write to) such a buffer. It also means that