Skip to content

Instantly share code, notes, and snippets.

@bjacob
Last active February 24, 2021 03:36
Show Gist options
  • Save bjacob/427233765cff7f3136356a308f806ff6 to your computer and use it in GitHub Desktop.
Save bjacob/427233765cff7f3136356a308f806ff6 to your computer and use it in GitHub Desktop.

Here is a matmul with two ops, producer_lhs and producer_rhs, fused into it. The producers have a cost. They could be just reading constant data (e.g. weights of a conv op) or they could be more expensive math (e.g. math-function activation function of preceding layer). Either way, they have non-negligible cost (even reading constant data has the cost of memory accesses).

for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < K; k++) {
      result[i, j] += producer_lhs(i, k) * producer_rhs(k, j);
    }
  }
}

Claim: to perform efficiently this N^3 work on N^2 data we need:

  • the output of producer_lhs and producer_rhs to be materialized as a plain buffer as large as the source matrices.
  • the loop nest to be transformed into a traversal that is suitably local in both i and j.
    • structuring the loop nest to have the nicest scanline traversal of one lhs/rhs side results in a worst-case traversal of the opposite side.
    • example: the above loop nest has the outer most loop over i, which nicest for lhs - each row is accessed only in one iteration of the outer loop, so no need to materialize the entire producer_lhs output buffer at once. But that causes the entire RHS to be fully retraversed M times.

Conclusions:

  • while the packing op may not exist anymore as a discrete op during execution, the packed matrices will have to exist in memory at runtime (possibly as constant data), the whole matrix not just a block at a time.

agree?

@asaadaldien
Copy link

asaadaldien commented Feb 24, 2021

Thanks Benoit and Ben for starting this...

The problem I am trying to address here is the possibility of avoiding recomputing packing. Consider the following IR:

%0 = linalg.matmul(%lhs, %rhs) : (tensor<?x128xf32>, tensor<128x?xf32>) -> tensor<?x?xf32>

Now assume we have two (hypothetical not implemented yet) ops linalg.pack that does the packing for operands and a linalg.packed_matmul that operates on packed tensors. A transformation matmul -> packed_matmul will look like the following:

%0 = linalg.pack(%lhs, %c0) : tensor<?x128xf32> -> tensor<?x4x128xf32> 
%1 = linalg.pack(%rhs, %c1) : tensor<128x?xf32> -> tensor<?x128x4xf32>
%2 = linalg.packed_matmul(%0, %1) : (tensor<?x4x128xf32>, tensor<?x4x128xf32>) -> tensor<?x?xf32>

Which can be naturally tiled on the outer most dim of both lhs and lhs and it will look like the following:

// Now the tiled packed op loops will look like the following
scf.for %i = %c0 to %lhs_tiles_sizes step %c1 {    
    %packed_rhs = (a sequence of subtensor + pad ops operates on lhs tile i) : tensor<?x4x128xf32>
    %lhs_view = subtensor %packed_lhs(%i, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    for %j = %c0 to %rhs_tiles_size step %c1 {
       %packed_lhs = (a sequence of subtensor + pad ops operates on rhs tile j) : tensor<?x4x128xf32>
        %rhs_view = subtensor %packed_rhs(%j, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
        %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
    }
}

As you can see distribution of these loops leads to recomputing packed_lhs & packed_rhs for each workgroup.

One possibility to amortize the quadratic packing cost is by lifting packing loops:

%packed_lhs = scf.for %i = %c0 to %lhs_tiles_sizes step %c1 (%lhs) {    
  // A sequance of  subtensor + pad ops ...
} -> (tensor<?x128xf32>) -> tensor<?x4x128>

%packed_rhs = for %j = %c0 to %rhs_tiles_size step %c1 (%rhs) {
  // A sequance of  subtensor + pad ops ...
} -> (tensor<128x?xf32>) -> tensor<?x128x4xf32>

scf.for %i = %c0 to %lhs_tiles_sizes step %c1 {    
    %lhs_view = subtensor %packed_lhs(%i, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    for %j = %c0 to %rhs_tiles_size step %c1 {
        %rhs_view = subtensor %packed_rhs(%j, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
        %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
    }
}

This now can be distributed into workgroups as 2 x 1-D grid dispatches for packing + 2-D dispatch for computation assume dispatches can be executed according to a tile dependancy (i, j, t) depends on (i, 0:N-1, t-1). Or if this isn't preferred way of scheduling workloads we have to have explicit synchronization. Otherwise we have to do the version where packing is fused with its consumer which leads to redundant compute which is what we have now.

@benvanik
Copy link

benoit: I think that's what ntv was showing today - his rules for how to decide the "schedule" (I'm using that term loosely - not sure what he's calling it) of ops and how they are ordered is the mechanism. I've only seen it flying by but iree-org/iree#4890 is the code and specifically the -linalg-tensor-codegen-strategy="anchor-func=matmul anchor-op=linalg.matmul tile-sizes=256,256,256" stuff is the kind of information being specified. Now, how that translates to non-command-line code I don't know but that's where to start chatting with him :) I think the sandbox he is building there is intended as the playground to start experimenting with the strategies that then can be migrated into the upstream linalg code.

ahmed: that's precisely what I'm saying not to worry about :) By construction you should (practically and maybe actually) never end up with an scf.for that contains the packing and matmul together as in your 2nd IR snippet, as the packing operations should be root ops that are merged with the producers of their inputs and not with the consumer matmul. What you show in the last snippet is what you should end up with by way of linalg transformations prior to tiling/distribution in my examples. That is to say, once you hit tile/distribute you are too late to start making global changes to the program: global changes are for prior to tile/distribute. There's no knowledge you get after tile/distribute that you don't have prior to it (that you are performing a matmul, the shapes/sizes of the matmul, and the producers of its inputs and consumer of its outputs, etc etc). Tile/distribute is a lossy operation and that's a good thing for many reasons: it just means you need to be doing the kind of transformations like that prior to crossing that (conceptual and physical) boundary.

@asaadaldien
Copy link

Ben: Sorry, I was trying to be less confusing (and failed :D) by writing the tiled version with a serial loop without distribution).

After distribution every loop index here should be a workroup_ids :

workgroup.dispatch {
    %packed_rhs = (a sequence of subtensor + pad ops operates on lhs tile i) : tensor<?x4x128xf32>
    %lhs_view = subtensor %packed_lhs(%workgroup_id_y, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    %packed_lhs = (a sequence of subtensor + pad ops operates on rhs tile j) : tensor<?x4x128xf32>
    %rhs_view = subtensor %packed_rhs(%workgroup_id_x, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
    %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
 }

The second version with hoisted packing will look like:

workgroup.dispatch {    
  // A sequance of  subtensor + pad ops ...
} -> (tensor<?x128xf32>) -> tensor<?x4x128>

workgroup.dispatch {   
  // A sequance of  subtensor + pad ops ...
} -> (tensor<128x?xf32>) -> tensor<?x128x4xf32>

workgroup.dispatch  {    
    %lhs_view = subtensor %packed_lhs(%workgroup_id_x, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    %rhs_view = subtensor %packed_rhs(%workgroup_id_y, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
     %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
}

The second version is something we can do now indeed but at the cost of matmul waiting for packing rhs and lhs.

Is that the the cost you were referring to not worry about ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment