Starting point: In IREE/Linalg,
- There ultimately won't be any packing ops (because anything resembling a packing op will get fused into the producer).
- But there will still be a packed layout (because the repeated N^3 accesses to N^2 data mean that it is an optimization to materialize the ultimate matmul lhs/rhs operands as buffers, and efficiency considerations mean that the layout of these buffers will have some block structure).
So a priori, at some stage of lowering just before codegen, our matmuls have as lhs/rhs inputs some materialized buffers in some possibly nontrivial layouts.
I started thinking about how much we can restrict layouts. If you look at classic GEMM papers like the BLIS papers, it can scare you because they make it sound like you need N nested levels of blocks, and some of that may need to reflect the CPU cache hierarchy.
And so in particular that makes it sound like the introduction of packed layouts is adding new degrees of freedom, new dimensions to the search space in the search for optimal matmul lowerings to CPU.
It turns out (and that will be the main influx of design idea from ruy into this project) that we should stick to a very simple, restricted form of layout, with only 1 level of block subdivison, and that level is entirely dictated by the ISA (SIMD instructions and shape of the SIMD register space) so it's not even a "degree of freedom" at all.
For example, when lowering a i8 matmul to ARM-dotprod, the packed layout must be: one level of tiling by 8x4 blocks.
This happens to generalize to all ARM (and almost certainly x86) CPU uarchs from the past 5 years and into the next 3 years.
Another point that I would like to make is (again unlike the tradition embodied by the BLIS paper), we should be using essentially the same packed layouts for LHS and RHS (only a minor variation possible in the case where we need to narrow a kernel by 2x in one dimension to fit in register space). How's that possible? With a normal matmul, the inner loop traverse LHS along rows and RHS along columns, so there is a fundamental dissymetry. The solution is to implement not exactly matmul, but "MM^T", a matmul op where the RHS is transposed, so that now we are traversing both LHS and RHS along rows.
This "MM^T" op now lends itself to the simplest possible description of the generality of packed layouts that it needs to be fed with:
- Both LHS and RHS operands of "MM^T" should be in a layout with 1 level of block subdivision, the blocks are row-major and of power-of-two rectangular shape fixed by the ISA, the outer layout is also row-major.
So at this point I'm thinking that we should have something like a "linalg.mmt" op that would be a lowering path for linalg.matmul. Unlike linalg.matmul, it should be an internal IR thing, not part of the "API".
Another way in which this might be useful is in the quantized case with zero_points, the need to pass pre-computed sums to the matmul kernel was awkward to fit in linalg.matmul. Instead, we could let the "mmt" op take such extra parameters.
WDYT?