README.md

Here is a matmul with two ops, producer_lhs and producer_rhs, fused into it. The producers have a cost. They could be just reading constant data (e.g. weights of a conv op) or they could be more expensive math (e.g. math-function activation function of preceding layer). Either way, they have non-negligible cost (even reading constant data has the cost of memory accesses).

for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < K; k++) {
      result[i, j] += producer_lhs(i, k) * producer_rhs(k, j);
    }
  }
}

Claim: to perform efficiently this N^3 work on N^2 data we need:

the output of producer_lhs and producer_rhs to be materialized as a plain buffer as large as the source matrices.
the loop nest to be transformed into a traversal that is suitably local in both i and j.
- structuring the loop nest to have the nicest scanline traversal of one lhs/rhs side results in a worst-case traversal of the opposite side.
- example: the above loop nest has the outer most loop over i, which nicest for lhs - each row is accessed only in one iteration of the outer loop, so no need to materialize the entire producer_lhs output buffer at once. But that causes the entire RHS to be fully retraversed M times.

Conclusions:

while the packing op may not exist anymore as a discrete op during execution, the packed matrices will have to exist in memory at runtime (possibly as constant data), the whole matrix not just a block at a time.

agree?

workgroup.dispatch { %packed_rhs = (a sequence of subtensor + pad ops operates on lhs tile i) : tensor<?x4x128xf32> %lhs_view = subtensor %packed_lhs(%workgroup_id_y, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32> %packed_lhs = (a sequence of subtensor + pad ops operates on rhs tile j) : tensor<?x4x128xf32> %rhs_view = subtensor %packed_rhs(%workgroup_id_x, %c0, %c0) [128, 4] .... : tensor<128x4xf32> %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32> // .. insert tile_result in dst_tile(i, j) }

workgroup.dispatch { // A sequance of subtensor + pad ops ... } -> (tensor<?x128xf32>) -> tensor<?x4x128> workgroup.dispatch { // A sequance of subtensor + pad ops ... } -> (tensor<128x?xf32>) -> tensor<?x128x4xf32> workgroup.dispatch { %lhs_view = subtensor %packed_lhs(%workgroup_id_x, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32> %rhs_view = subtensor %packed_rhs(%workgroup_id_y, %c0, %c0) [128, 4] .... : tensor<128x4xf32> %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32> // .. insert tile_result in dst_tile(i, j) }

bjacob/README.md

asaadaldien commented Feb 24, 2021

Uh oh!