Skip to content

Instantly share code, notes, and snippets.

@bjacob
Last active February 24, 2021 03:36
Show Gist options
  • Save bjacob/427233765cff7f3136356a308f806ff6 to your computer and use it in GitHub Desktop.
Save bjacob/427233765cff7f3136356a308f806ff6 to your computer and use it in GitHub Desktop.

Here is a matmul with two ops, producer_lhs and producer_rhs, fused into it. The producers have a cost. They could be just reading constant data (e.g. weights of a conv op) or they could be more expensive math (e.g. math-function activation function of preceding layer). Either way, they have non-negligible cost (even reading constant data has the cost of memory accesses).

for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    for (int k = 0; k < K; k++) {
      result[i, j] += producer_lhs(i, k) * producer_rhs(k, j);
    }
  }
}

Claim: to perform efficiently this N^3 work on N^2 data we need:

  • the output of producer_lhs and producer_rhs to be materialized as a plain buffer as large as the source matrices.
  • the loop nest to be transformed into a traversal that is suitably local in both i and j.
    • structuring the loop nest to have the nicest scanline traversal of one lhs/rhs side results in a worst-case traversal of the opposite side.
    • example: the above loop nest has the outer most loop over i, which nicest for lhs - each row is accessed only in one iteration of the outer loop, so no need to materialize the entire producer_lhs output buffer at once. But that causes the entire RHS to be fully retraversed M times.

Conclusions:

  • while the packing op may not exist anymore as a discrete op during execution, the packed matrices will have to exist in memory at runtime (possibly as constant data), the whole matrix not just a block at a time.

agree?

@benvanik
Copy link

benvanik commented Feb 23, 2021

There's an important distinction that it's not that the "packed matrices must exist in memory" but that the "inputs to an dispatch must exist in memory and for matmuls those inputs should have a specific layout." There's no such thing as a packed matrix buffer here and instead there are just input buffers to this matrix multiply that are packed. It's almost never helpful to then think of this in isolation and instead for any operation to think of what produces its inputs (and also what is consuming its outputs). If this were a simple set of expressions in C like int t = a * b; int u = t * c; you wouldn't treat it any differently as a compiler than int u = (a * b) + c;, and you wouldn't make any assertion (or even mention) the registers that must be allocated until much later in the process as you also need to factor in what produces a/b/c and what consumes u. It's the same here at the level we are talking about in linalg: you must think of all loopy operations acting on tensors - some of which may be matrix multiplies - together as a whole. That's the superpower of linalg :)

The loop above is already a bit lower than the level of detail needed to reason about data layout efficiently with respect to packing, though. Instead if you thought of this as:

%lhs = linalg.foo %input0
%rhs = linalg.bar %input1
%ret = linalg.matrix_multiply %lhs, %rhs : ...

Then what you want to express is that the lhs and rhs must be packed:

%lhs = linalg.foo
%rhs = linalg.bar
%lhs_packed = linalg.pack %lhs
%rhs_packed = linalg.pack %rhs
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

(there may be a packing named op, or just a transformation that inserts layout changes, etc)

Normal compiler transformations will then be able to work on those, like CSE. Say you used the rhs twice:

%lhs0 = linalg.foo %input0
%lhs1 = linalg.foo %input1
%rhs = linalg.bar
%lhs0_packed = linalg.pack %lhs0
%rhs_packed = linalg.pack %rhs  // CSE'd
%ret0 = linalg.matrix_multiply %lhs0_packed, %rhs_packed : ...
%lhs1_packed = linalg.pack %lhs1
%ret1 = linalg.matrix_multiply %lhs1_packed, %rhs_packed : ...

But so too can linalg-specific transformations operate on this and sink the packing into the producer (or, more correctly, use the packing as the root into which linalg.foo and linalg.bar are pulled in to):

%lhs_packed = linalg.foo_and_pack %input0
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

(it's unlikely that there is a foo_and_pack named op, but instead that there's a linalg.generic/etc with the packing applied as index mapping changes/additional operations in the loop body)

Now there's no intermediates that are ever unpacked and thus later on there are no buffers that will be required that were not already required to begin with to hold the lhs and rhs. There's no need to worry about the rhs(k, j) being reused inside of the i nest, as you are just loading the input and always had to do that anyway regardless of whether it is packed or not - it's just bytes to fetch like any others. Conceptually and literally this is like folding the ruy prepacking into the producer and then only having a code path in the matmul to use prepacked inputs.

All of this is to say that packing is just another operation to perform on data, linalg is exceedingly good at manipulating operations on data, and which operations are performed and in what order is the stuff ntv@ was mentioning and what actually matters. For example here that the packing should be treated as a root that feeds into a consumer matmul is a very important distinction from the case where the matmul is the root that the packs are folded into. That's the expert knowledge that then gets combined with searchable parameters (what is the form of the packing, tile sizes, etc).

In the case above both lhs and rhs are fully dynamic, but you can see how when treating them as just operations on data anything you've done in things like toco can now be performed too. For example:

%const = constant ...
%lhs_packed = linalg.foo_and_pack %const
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

Results after constant folding:

%const_packed = constant (packing folded)...
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %const_packed, %rhs_packed : ...

The whole-program transformation also works (this is effectively ruy prepacking):

%t0 = linalg.foo
flow.variable.store %t0, @variable
...
%t1 = flow.variable.load @variable
%lhs_packed = linalg.pack %t1
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

->

%t0 = linalg.foo
%t0_packed = linalg.pack %t0
flow.variable.store %t0_packed, @variable
...
%t1 = flow.variable.load @variable
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %t1, %rhs_packed : ...

->

%t0_packed = linalg.foo_and_pack
flow.variable.store %t0_packed, @variable
...
%t1 = flow.variable.load @variable
%rhs_packed = linalg.bar_and_pack %input1
%ret = linalg.matrix_multiply %t1, %rhs_packed : ...

And now again there's no intermediates used solely for packing - there's just inputs and outputs and sometimes the layout of those inputs and outputs is in a certain form. Now, if the input to be packed comes from outside the program a dedicated packing dispatch may be required. I'm fine with that: if there is literally nothing else you are doing to your raw program-provided input besides passing it directly to a matmul you probably don't need to (and should not) be using IREE at all, but ruy as a library. We of course still work and will do the packing but won't lose sleep if there's some cache misses as again IREE is not a BLAS library and inefficiencies around the external ABI are something for the application to fix (if they care). The transient allocations will get reused too when not longer live - and if there's no other work that could reuse it again things fall into the too-simple-for-IREE-to-care-about category. So if it helps I'll say it on record: yes there is a case where packing can absolutely not be fused with any other operation and as such there is an additional transient allocation required to store that packed intermediate input to the matmul, but it's vanishingly unlikely to occur in anything but microbenchmarks and 🤷 :) Think of it like how exported C functions use the cdecl calling convention even if internal functions can use better ones that more aggressively use registers/etc -- if the bottleneck of the application is the infrequent cdecl call vs the entire compute complexity of the work being done then the entire application architecture is at fault, not the C compiler emitting the code in accordance with the calling convention.

All that said, the majority of cases can be handled by the above strategies and those are the interesting ones to generalize and tackle (and poke holes in/experiment with/etc). The statement here gives the freedom to eliminate the case of where producers of inputs are entirely unknowable and instead focus on the case where producers are mutable and can do whatever we want. The problem statement is to figure out what we want :)

Finally worth mention is that in the dynamic cases where you want to vary the packing based on runtime parameters (cache line size, etc) you have several options. The simplest is just to take the packing parameter as an SSA value:

%lhs = linalg.foo
%rhs = linalg.bar
%cache_line_size_mumble = hal.device.query "cache_line_size_mumble" : index
%lhs_packed = linalg.pack %lhs, %cache_line_size_mumble
%rhs_packed = linalg.pack %rhs, %cache_line_size_mumble
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

->

%cache_line_size_mumble = hal.device.query "cache_line_size_mumble" : index
%lhs_packed = linalg.foo_and_pack %input0, %cache_line_size_mumble
%rhs_packed = linalg.bar_and_pack %input1, %cache_line_size_mumble
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

-> linalg ops with affine maps that have some slightly different math

If you specify the size at compile-time (for a specific arch) that hal.device.query folds to a constant and it's no different than if you had specified it directly, only now this code is tolerant to the runtime variants. No different than ruy: your block map takes at runtime a bunch of parameters (rows, kernel_rows, etc) and does some math on them - the same is happening here. You could also specialize these cases if there was benefit - for example:

%cache_line_size_mumble = hal.device.query "cache_line_size_mumble" : index
%lhs_packed = switch...%cache_line_size_mumble... {
  case 32: return linalg.foo_and_pack %input0, 32
  case 64: return linalg.foo_and_pack %input0, 64
  default: return linalg.foo_and_pack %input0, %cache_line_size_mumble
}
%rhs_packed = switch...%cache_line_size_mumble... {
  case 32: return linalg.bar_and_pack %input1, 32
  case 64: return linalg.bar_and_pack %input1, 64
  default: return linalg.bar_and_pack %input1, %cache_line_size_mumble
}
%ret = linalg.matrix_multiply %lhs_packed, %rhs_packed : ...

(but none of that may be needed - just as you don't template your BlockMap)

Now as Ahmed mentioned for strict back-to-back execution of producer and consumer there's the possibility to keep caches warmer by ensuring they run tilewise together, but in many cases that doesn't matter in the face of all the other extreme efficiencies now possible with this approach. There's also many other ways to much more strictly control that behavior. For example in your loop you can easily ensure you get good locality on outputs by folding in subsequent operations:

for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    float accum = initial_value;
    for (int k = 0; k < K; k++) {
      accum += producer_lhs(i, k) * producer_rhs(k, j);
    }
    result[i, j] = any_number_of_fused_elementwise_ops(accum);
  }
}

Better locality on the inputs can be done by rearranging your loops; if what you want to say is that all i's for a particular j,k should happen after j,k is produced and in cache, then flip your loops:

for (int j = 0; j < N; j++) {
  for (int k = 0; k < K; k++) {
    rhs = producer_rhs(k, j);
    for (int i = 0; i < M; i++) {
      result[i, j] += producer_lhs(i, k) * rhs;
    }
  }
}

(don't concentrate on whether the specific example above is valid for matmul - I don't know - but there's a lot of good documentation spanning back 15 years on how this is done in the CUDA/GPGPU world and the point is to use linalg to do/discover such things :)

The point here is that yes you can do crazier things (cross-thread/dispatch fine-grained barriers for inter-tile communication) but there are also other ways and some of them may entirely obviate the need for that. First hit the wall and then find out what kind of door to put in it and how, and the wall is still a few km away at this point :)

@bjacob
Copy link
Author

bjacob commented Feb 23, 2021

Thanks a lot for the explanation!

I understand (now - thanks) that packing just gets fused into the preceding op, so there's nothing special about it.

What's special is that it is consumed by a matmul.

If instead it were consumed by an elementwise op, we would never think of materializing its output in a buffer.

But because it's consumed by a matmul (and even it it isn't consumed by anything else than this single matmul), which makes repeated accesses, it becomes worth materializing its (entire) output in a (large) buffer.

How does the compiler know that? Is there some cost model? (Technically this could be seen from the actual accesses made by the matmul, but I suppose that information only becomes available at a late stage of lowering, and we need it earlier).

@asaadaldien
Copy link

asaadaldien commented Feb 24, 2021

Thanks Benoit and Ben for starting this...

The problem I am trying to address here is the possibility of avoiding recomputing packing. Consider the following IR:

%0 = linalg.matmul(%lhs, %rhs) : (tensor<?x128xf32>, tensor<128x?xf32>) -> tensor<?x?xf32>

Now assume we have two (hypothetical not implemented yet) ops linalg.pack that does the packing for operands and a linalg.packed_matmul that operates on packed tensors. A transformation matmul -> packed_matmul will look like the following:

%0 = linalg.pack(%lhs, %c0) : tensor<?x128xf32> -> tensor<?x4x128xf32> 
%1 = linalg.pack(%rhs, %c1) : tensor<128x?xf32> -> tensor<?x128x4xf32>
%2 = linalg.packed_matmul(%0, %1) : (tensor<?x4x128xf32>, tensor<?x4x128xf32>) -> tensor<?x?xf32>

Which can be naturally tiled on the outer most dim of both lhs and lhs and it will look like the following:

// Now the tiled packed op loops will look like the following
scf.for %i = %c0 to %lhs_tiles_sizes step %c1 {    
    %packed_rhs = (a sequence of subtensor + pad ops operates on lhs tile i) : tensor<?x4x128xf32>
    %lhs_view = subtensor %packed_lhs(%i, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    for %j = %c0 to %rhs_tiles_size step %c1 {
       %packed_lhs = (a sequence of subtensor + pad ops operates on rhs tile j) : tensor<?x4x128xf32>
        %rhs_view = subtensor %packed_rhs(%j, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
        %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
    }
}

As you can see distribution of these loops leads to recomputing packed_lhs & packed_rhs for each workgroup.

One possibility to amortize the quadratic packing cost is by lifting packing loops:

%packed_lhs = scf.for %i = %c0 to %lhs_tiles_sizes step %c1 (%lhs) {    
  // A sequance of  subtensor + pad ops ...
} -> (tensor<?x128xf32>) -> tensor<?x4x128>

%packed_rhs = for %j = %c0 to %rhs_tiles_size step %c1 (%rhs) {
  // A sequance of  subtensor + pad ops ...
} -> (tensor<128x?xf32>) -> tensor<?x128x4xf32>

scf.for %i = %c0 to %lhs_tiles_sizes step %c1 {    
    %lhs_view = subtensor %packed_lhs(%i, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    for %j = %c0 to %rhs_tiles_size step %c1 {
        %rhs_view = subtensor %packed_rhs(%j, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
        %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
    }
}

This now can be distributed into workgroups as 2 x 1-D grid dispatches for packing + 2-D dispatch for computation assume dispatches can be executed according to a tile dependancy (i, j, t) depends on (i, 0:N-1, t-1). Or if this isn't preferred way of scheduling workloads we have to have explicit synchronization. Otherwise we have to do the version where packing is fused with its consumer which leads to redundant compute which is what we have now.

@benvanik
Copy link

benoit: I think that's what ntv was showing today - his rules for how to decide the "schedule" (I'm using that term loosely - not sure what he's calling it) of ops and how they are ordered is the mechanism. I've only seen it flying by but iree-org/iree#4890 is the code and specifically the -linalg-tensor-codegen-strategy="anchor-func=matmul anchor-op=linalg.matmul tile-sizes=256,256,256" stuff is the kind of information being specified. Now, how that translates to non-command-line code I don't know but that's where to start chatting with him :) I think the sandbox he is building there is intended as the playground to start experimenting with the strategies that then can be migrated into the upstream linalg code.

ahmed: that's precisely what I'm saying not to worry about :) By construction you should (practically and maybe actually) never end up with an scf.for that contains the packing and matmul together as in your 2nd IR snippet, as the packing operations should be root ops that are merged with the producers of their inputs and not with the consumer matmul. What you show in the last snippet is what you should end up with by way of linalg transformations prior to tiling/distribution in my examples. That is to say, once you hit tile/distribute you are too late to start making global changes to the program: global changes are for prior to tile/distribute. There's no knowledge you get after tile/distribute that you don't have prior to it (that you are performing a matmul, the shapes/sizes of the matmul, and the producers of its inputs and consumer of its outputs, etc etc). Tile/distribute is a lossy operation and that's a good thing for many reasons: it just means you need to be doing the kind of transformations like that prior to crossing that (conceptual and physical) boundary.

@asaadaldien
Copy link

Ben: Sorry, I was trying to be less confusing (and failed :D) by writing the tiled version with a serial loop without distribution).

After distribution every loop index here should be a workroup_ids :

workgroup.dispatch {
    %packed_rhs = (a sequence of subtensor + pad ops operates on lhs tile i) : tensor<?x4x128xf32>
    %lhs_view = subtensor %packed_lhs(%workgroup_id_y, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    %packed_lhs = (a sequence of subtensor + pad ops operates on rhs tile j) : tensor<?x4x128xf32>
    %rhs_view = subtensor %packed_rhs(%workgroup_id_x, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
    %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
 }

The second version with hoisted packing will look like:

workgroup.dispatch {    
  // A sequance of  subtensor + pad ops ...
} -> (tensor<?x128xf32>) -> tensor<?x4x128>

workgroup.dispatch {   
  // A sequance of  subtensor + pad ops ...
} -> (tensor<128x?xf32>) -> tensor<?x128x4xf32>

workgroup.dispatch  {    
    %lhs_view = subtensor %packed_lhs(%workgroup_id_x, %c0, %c0) [4, 128] .... : tensor<?x4x128xf32> to tensor<4x128xf32>
    %rhs_view = subtensor %packed_rhs(%workgroup_id_y, %c0, %c0) [128, 4] .... : tensor<128x4xf32>
     %tile_result = linalg.matmul(%lhs_view, rhs_view) : (tenosr<4x128xf32>, tensor<128x4xf32>) -> tensor<4x4xf32>
        // .. insert tile_result  in dst_tile(i, j)
}

The second version is something we can do now indeed but at the cost of matmul waiting for packing rhs and lhs.

Is that the the cost you were referring to not worry about ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment