(draft; work in progress)
See also:
- Compilers
- Program analysis:
- Dynamic analysis - instrumentation, translation, sanitizers
(draft; work in progress)
See also:
This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).
Matrix multiplication is a mathematical operation that defines the product of
| #define _CRT_SECURE_NO_DEPRECATE | |
| #include <stdio.h> | |
| #include <string.h> | |
| #include <Windows.h> | |
| // This allocates a "magic ring buffer" that is mapped twice, with the two | |
| // copies being contiguous in (virtual) memory. The advantage of this is | |
| // that this allows any function that expects data to be contiguous in | |
| // memory to read from (or write to) such a buffer. It also means that |
| Latency Comparison Numbers (~2012) | |
| ---------------------------------- | |
| L1 cache reference 0.5 ns | |
| Branch mispredict 5 ns | |
| L2 cache reference 7 ns 14x L1 cache | |
| Mutex lock/unlock 25 ns | |
| Main memory reference 100 ns 20x L2 cache, 200x L1 cache | |
| Compress 1K bytes with Zippy 3,000 ns 3 us | |
| Send 1K bytes over 1 Gbps network 10,000 ns 10 us | |
| Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD |