Here is a matmul with two ops, producer_lhs and producer_rhs, fused into it. The producers have a cost.
They could be just reading constant data (e.g. weights of a conv op) or they could be more expensive math
(e.g. math-function activation function of preceding layer). Either way, they have non-negligible cost
(even reading constant data has the cost of memory accesses).
for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < K; k++) {
      result[i, j] += producer_lhs(i, k) * producer_rhs(k, j);
    }
  }
}
Claim: to perform efficiently this N^3 work on N^2 data we need:
- the output of producer_lhsandproducer_rhsto be materialized as a plain buffer as large as the source matrices.
- the loop nest to be transformed into a traversal that is suitably local in both i and j.
- structuring the loop nest to have the nicest scanline traversal of one lhs/rhs side results in a worst-case traversal of the opposite side.
- example: the above loop nest has the outer most loop over i, which nicest for lhs - each row is accessed only in one iteration of the outer loop, so no need to materialize the entire producer_lhs output buffer at once. But that causes the entire RHS to be fully retraversed M times.
 
Conclusions:
- while the packing op may not exist anymore as a discrete op during execution, the packed matrices will have to exist in memory at runtime (possibly as constant data), the whole matrix not just a block at a time.
agree?
Thanks Benoit and Ben for starting this...
The problem I am trying to address here is the possibility of avoiding recomputing packing. Consider the following IR:
Now assume we have two (hypothetical not implemented yet) ops
linalg.packthat does the packing for operands and alinalg.packed_matmulthat operates on packed tensors. A transformationmatmul -> packed_matmulwill look like the following:Which can be naturally tiled on the outer most dim of both lhs and lhs and it will look like the following:
As you can see distribution of these loops leads to recomputing
packed_lhs & packed_rhsfor each workgroup.One possibility to amortize the quadratic packing cost is by lifting packing loops:
This now can be distributed into workgroups as 2 x 1-D grid dispatches for packing + 2-D dispatch for computation assume dispatches can be executed according to a tile dependancy
(i, j, t) depends on (i, 0:N-1, t-1). Or if this isn't preferred way of scheduling workloads we have to have explicit synchronization. Otherwise we have to do the version where packing is fused with its consumer which leads to redundant compute which is what we have now.