by Angel Leon. March 17, 2015;
Last update on December 14, 2023
Updated on February 27, 2023
Updated August 29, 2019.
This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).
Matrix multiplication is a mathematical operation that defines the product of
| SDK = xcrun -sdk macosx | |
| all: compute.metallib compute | |
| compute.metallib: Compute.metal | |
| # Metal intermediate representation (.air) | |
| $(SDK) metal -c -Wall -Wextra -std=osx-metal2.0 -o /tmp/Compute.air $^ | |
| # Metal library (.metallib) | |
| $(SDK) metallib -o $@ /tmp/Compute.air |
| import React, { Component } from 'react'; | |
| import styled from 'styled-components'; | |
| const Figure = styled.figure` | |
| height: 0; | |
| margin: 0; | |
| background-color: #efefef; | |
| position: relative; | |
| padding-bottom: ${props => props.ratio}%; | |
| `; |
In this article, we will see how to use
CRTP,std::variantandstd::visitto increase our code performances.
| .root { | |
| display: block; | |
| position: relative; | |
| } | |
| .lqip { | |
| image-rendering: pixelated; | |
| width: 100%; | |
| opacity: 1; | |
| transition: opacity 50ms 100ms ease-out; |
| mlir-opt matmult.mlir -convert-linalg-to-loops -lower-affine -convert-scf-to-cf -convert-linalg-to-llvm -convert-memref-to-llvm -convert-func-to-llvm -reconcile-unrealized-casts > out.mlir | |
| mlir-cpu-runner out.mlir -O3 -e main -entry-point-result=void --shared-libs=libmlir_runner_utils.dylib |