(draft; work in progress)
See also:
- Compilers
- Program analysis:
- Dynamic analysis - instrumentation, translation, sanitizers
(draft; work in progress)
See also:
This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).
Matrix multiplication is a mathematical operation that defines the product of
#define _CRT_SECURE_NO_DEPRECATE | |
#include <stdio.h> | |
#include <string.h> | |
#include <Windows.h> | |
// This allocates a "magic ring buffer" that is mapped twice, with the two | |
// copies being contiguous in (virtual) memory. The advantage of this is | |
// that this allows any function that expects data to be contiguous in | |
// memory to read from (or write to) such a buffer. It also means that |
Latency Comparison Numbers (~2012) | |
---------------------------------- | |
L1 cache reference 0.5 ns | |
Branch mispredict 5 ns | |
L2 cache reference 7 ns 14x L1 cache | |
Mutex lock/unlock 25 ns | |
Main memory reference 100 ns 20x L2 cache, 200x L1 cache | |
Compress 1K bytes with Zippy 3,000 ns 3 us | |
Send 1K bytes over 1 Gbps network 10,000 ns 10 us | |
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD |