Here is a matmul with two ops, producer_lhs and producer_rhs, fused into it. The producers have a cost.
They could be just reading constant data (e.g. weights of a conv op) or they could be more expensive math
(e.g. math-function activation function of preceding layer). Either way, they have non-negligible cost
(even reading constant data has the cost of memory accesses).
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < K; k++) {
      result[i, j] += producer_lhs(i, k) * producer_rhs(k, j);
    }
  }
}
Claim: to perform efficiently this N^3 work on N^2 data we need:
- the output of producer_lhsandproducer_rhsto be materialized as a plain buffer as large as the source matrices.
- the loop nest to be transformed into a traversal that is suitably local in both i and j.
- structuring the loop nest to have the nicest scanline traversal of one lhs/rhs side results in a worst-case traversal of the opposite side.
- example: the above loop nest has the outer most loop over i, which nicest for lhs - each row is accessed only in one iteration of the outer loop, so no need to materialize the entire producer_lhs output buffer at once. But that causes the entire RHS to be fully retraversed M times.
 
Conclusions:
- while the packing op may not exist anymore as a discrete op during execution, the packed matrices will have to exist in memory at runtime (possibly as constant data), the whole matrix not just a block at a time.
agree?
Thanks a lot for the explanation!
I understand (now - thanks) that packing just gets fused into the preceding op, so there's nothing special about it.
What's special is that it is consumed by a matmul.
If instead it were consumed by an elementwise op, we would never think of materializing its output in a buffer.
But because it's consumed by a matmul (and even it it isn't consumed by anything else than this single matmul), which makes repeated accesses, it becomes worth materializing its (entire) output in a (large) buffer.
How does the compiler know that? Is there some cost model? (Technically this could be seen from the actual accesses made by the matmul, but I suppose that information only becomes available at a late stage of lowering, and we need it earlier).