benvanik/external-transients.md

Last active October 24, 2025 07:59

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/benvanik/9f6ca5fdc17e0e50938bf34a2437cac2.js"></script>
Save benvanik/9f6ca5fdc17e0e50938bf34a2437cac2 to your computer and use it in GitHub Desktop.

External transient storage design doc

Raw

Users are requesting that we let them provide a buffer to their functions and have all transient memory in the function be allocated from that. We can then generate a function that calculates how much transient memory is required so the application can query it before calling.

The motivating use for all of this is building a kernel JIT on top of IREE: they provide IR of their linalg ops, we compile it into one or more dispatches, and then we have our host code that schedules it and does the transient allocation. They need to be able to control their transient memory ahead of time (required by their applications) so this lets us give them the size queries to know how much transient memory a particular function requires (the maximum, if bounded) and then when they call the function they control where that transient memory comes from (we make no transient allocations). This way in the steady state if they alias their function results and have our passes run they'll have zero allocations.

We currently use attributes like iree.abi.output in the native WrapEntryPointsPass that look for !hal.buffer/!hal.buffer_view and insert new HAL ops like hal.tensor.import to communicate the HAL semantics to lower levels of the stack. My thought here is to add a new hal.tensor.transients op that takes the buffer/buffer_view and a list of tensor SSA values. This says "any transient memory used in the production of this tensor should be suballocated from this buffer" and similar to how iree.abi.output is a helper that gets lowered into hal.tensor.alias but a frontend like torch can directly use hal.tensor.alias at a lower-level so too can users: iree.abi.transients on a buffer/buffer_view arg gets lowered into hal.tensor.transients, and if the user wants a more complex behavior they can. The iree.abi.transiets attr would be on the arg as a UnitAttr: only one is allowed (when using iree.abi.*) and it lowers into the hal.tensor.transients on all arguments and all results. A power user could use hal.tensor.transients to separate some values from others (like in a NUMA system where they are sharding). We'd have some basic canonicalizers/folders that let us simplify the IR (hal.tensor.transients+hal.tensor.transients->hal.tensor.transients) or propagate it (if we wanted). The requirement for the user here is that all allocation sizes have to be computable from the inputs or immutable globals - if not, we error out unless they opt in to unsafe behavior with a flag where we just trust the user to have passed the correct sized buffer.

The hal.tensor.transients op would take the transient storage and one or more tensor values and return the tensor values (this keeps SSA use-def valid). e.g. %results:2 = hal.tensor.transients storage(%storage : !hal.buffer) %tensor0, %tensor1 : tensor<?xf32>, tensor<?xi32>.

So on input using the IREE WrapEntryPointsPass:

util.func public @my_fn(%arg0: tensor<?xf32>, %arg1: index, %arg2: !hal.buffer {iree.abi.transients}) -> tensor<?xf32> {
  %t = ....(%arg0, %arg1) ...
  util.return %t : tensor<?xf32>
}

Would lower to IR like this, which users could also annotate themselves if they wanted fine-grained control:

util.func public @my_fn(%arg0: !hal.buffer_view, %arg1: index, %arg2: !hal.buffer) -> (!hal.buffer_view) {
  %arg0_tensor = hal.tensor.import %arg0 ... : tensor<?xf32>
  ...
  %result_annotated = hal.tensor.transients storage(%arg2 : !hal.buffer) %result : tensor<?xf32>
  %result_view = hal.tensor.export %result_annotated : tensor<?xf32>
  util.return %result_view : !hal.buffer_view
}

The hal.tensor.transient op would pass through flow unchanged, and then like hal.tensor.import would be turned into a stream.resource.transients when going flow->stream. This op like all other stream ops would take !stream.resources instead of tensors. It'd be mostly untouched until after ScheduleAllocationPass where we have a bunch of stream.resource.alloca / stream.resource.dealloca.

So we expect a function like this at some point:

util.func public @my_fn(...) -> !hal.buffer_view {
  %result, %result_timepoint = stream.resource.alloca ... <external>{%result_size}
  %transient, %alloca_timepoint = stream.resource.alloca ... <transient>{%transient_size}
  %exec_timepoint = stream.cmd.execute with(%transient) { ... }
  %dealloca_timepoint = stream.resource.dealloca %transient
  %result_annotated, %transient_timepoint = stream.resource.transients on(#hal...) storage(%arg2 : !hal.buffer) await(%exec_timepoint) %result : !stream.resource<external>{%result_size}
  hal.
  util.return %result_export, %transient_timepoint : !hal.buffer_view, ...
}

The new stream.resource.transients op would mirror the hal.tensor.transients op but take the resources instead. When lowering from the hal.tensor.transients to stream.resource.transients we need the timepoints for each resource and can insert a stream.timepoint.barrier op for each + stream.timepoint.join to get it. This lets us be able to track from a particular resource through the timeline to where it was created, effectively using the use-def chain of the timepoints as the way of tracking liveness (the full slice of the use-def chain has all the alloca/dealloca on it).

This lets us see that here that the %result transient comes from %dealloca_timepoint which is a dealloca of %transient, that comes from %exec_timepoint, which comes from %alloca_timepoint which is an alloca of %transient:

util.func public @my_fn(...) {
  %result, %result_timepoint = stream.resource.alloca ... <external>{%result_size}
  %transient, %alloca_timepoint = stream.resource.alloca ... <transient>{%transient_size}
  %exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
  %dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) %transient
  %result_annotated, %transient_timepoint = stream.resource.transients storage(%transients) await(%dealloca_timepoint) %result
}

Instead of building a liveness range over the IR ops in the function, we can build a liveness range over the asynchronous timeline represented by all the timepoints (globally number all timepoints, define an order with overlap). This of course doesn't work in all programs but those are the ones we must verify before we attempt to do this.

With that, we can cluster each allocated value by which transients they are bound to (if any). That gives us a map of user-provided transient -> every allocation in the program that uses it. From there, we can use the call graph to find all functions transitively called by all functions with transients as arguments, which lets us know who needs to have the new !stream.resource<transient> argument added to pass the user-provided buffer around.

We can also use this to check whether it's even possible: if the user did not ask for a size function then we don't need to do anything, but if they did we would error out if it wasn't possible to calculate the size without computation (nowhere is there a mutable global load, side-effecting external call, or load of a stream resource value) - otherwise, we know the exact use-def chain of every size value in the program. We want a full forward/backward slice in each function (starting at the leaves) so that we can hoist that math up in to callers. By doing so, we now give the callers a chance to fold in that math and then continue hoisting. By the end we'll end up with a big set of arithmetic (or a single folded constant) of everything that feeds in. To build the user requested size functions we'd take every transient required for a public function and make a function that has arguments required (if any, like !hal.buffer_view to get dynamic shape dimensions) and returns the size of each transient. Since the math is all pure we should always be able to hoist, as we already verified that unhoistable stuff was not present. In the end we have a function the user can call to get the size of the transient buffer they need for their given inputs.

We can use the call graph to verify that any function called from a function that marks transients with arguments that are colored needs to be called by either functions that need a transient buffer or functions that assign one too - this way we can safely change the function signatures to take a !stream.resource (+size index) for their transients. We'll want to update all calls to pass in one of their subviews that then the callee subviews again for its own internal transients.

With the liveness range over the timeline we can now put together stream.resource.pack operations that take the liveness ranges + size SSA values and produce a list of offsets + total size of the transient memory required. We'd replace the stream.resource.alloca with a stream.resource.subview and replace the timepoint result with the await of the alloca (or stream.timepoint.immediate) - same for the dealloca (no subview, just replace timepoint). In a function where the stream.resource.transient op exists we know the SSA value of the user storage (%arg0, a hal.buffer.allocate result, etc) and can take a subview of that. In functions called from functions using transient we need to add new !stream.resource<transient> arguments throughout the call graph so that we can always get the value to place within.

The big thing to solve here is that we need to know the size early so we can build subviews for callees. We want to keep the stream.resource.pack ops where they are locally so we get all of the offsets but need to repeatedly hoist the total size calculation all the way up to the root through all call edges to know what our total required size is (so we can subview from that). So in the end if there's a call graph of A->B->C, A will end up with the stream.resource.pack of itself (A) as well as B and C. This let's A nest the stream.resource.pack of B in itself (which contains the nest of C) to get the subview to pass to B, then B nest the stream.resource.pack of C in itself to get the subview for C, and C just has its own. If any function has no callees that need the transients they are untouched. If there's anything we can't hoist (pure ops on values always provided by callers or computed by values provided by callers/immutable global values) we should have failed to get to this point. This super hoisting operation is going to be tricky but we do it in some other places and it'd be good to have a robust utility for it. Since the user already allocated the buffer we can just add an assert that it's the right size on the final computed value of A's total size (or whichever subset of resources ended up in that transient storage).

We can probably handle some cases with dynamic values if we use integer range analysis (such as that provided by our util.assume.int ops). When computing the hoist of something if it relies on an unhoistable value and we have an analysis for it with a max value we could use that. This let's users have dynamic values produced by side-effecting functions or device readback by telling us what the expected range is and we just overallocate.

The final user feature is an optional new public function we create with that compacted transient size calculation. This new function would take whatever were required for the original function (buffer views to query their dimensions, etc) and produce one size value for every transient used by the function. We'd just stamp out the transient size stream.resource.pack nests as we did before and since they aren't reliant on any of the actual tensor data there should never be any device execution or allocation performed - just arith ops.

We'll want to try to split up this analysis and phases to try to make this all debuggable. I'm not quite sure how because the analysis will be so expensive and we don't want to muck up the IR too much and create trash between passes. The most expensive analysis is the initial SSA resource <-> SSA transient storage map and the timeline liveness analysis. We could use timeline liveness analysis in other passes too for things so it may be worth splitting out a DFX attribute for it that we'd be able to reuse (then we can also test that complex part independently in a test pass).

Example analysis (it'd be nice to share of possible):

/home/ben/src/iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp

Basic steps:

Transient storage analysis:
- For each stream.resource.transient op in the program:
  - Seed solver with the await and result timepoints
- Use the Explorer to walk all defining/using timeline ops in a worklist:
  - Seed each stream.resource.alloca/stream.resource.dealloca op resource
- Run the solver to compute:
  - For each alloca SSA result resource value:
    - Which ops deallocate it
    - Which transient storage SSA value it came from (anywhere in the program)
  - Whatever else we need for liveness analysis besides alloca/dealloca
Timeline liveness analysis:
- Using the transient storage analysis for alloca/dealloca information
- Walks timepoints to compute for each scope it can
- Produces a set of scopes per region for specific transients, for each:
  - Where it should be inserted in the IR
  - What alloca ops it covers
  - Which transient they come from (maybe calculated - to make the analysis useful to other passes who don't care about transients)
VerifyTransientStoragePass
- Run analysis and verify the program is possible to emplace
- This is our user-visible verification pass
- Use markAnalysisPreserved() since they are still valid for live range info
AnnotateTransientStoragePass
- Optional debugging pass like AnnotateAffinitiesPass
- Uses analysis to add informational attributes to each relevant op
  - Functions/scopes
  - Call sites
  - Alloca/dealloca ops
  - stream.resource.transient ops
- This is our primary test pass (as we can get the exact results of the analysis for CHECK tests)
- Use markAnalysisPreserved() since they are still valid for live range info
EmplaceTransientsPass (name open to suggestions of 'emplace' is the wrong verb)
- Uses CallGraph + the scope analysis to identify all functions that use a transient that is non-local to them
- Adds !stream.resource<transient>, index args to each function needing one (passing the size of the resource)
- For each scope:
  - Inserts stream.resource.pack op using the liveness ranges/sizes from analysis
  - Replaces all stream.resource.alloca ops with stream.resource.subview from the stream.resource.pack
  - Replaces alloca/dealloca timepoint results with their await operands (handling immediate if needed)
  - For each call adds the subview of the required transient and its subview size calculated by the stream.resource.pack
- Use markAnalysisPreserved() since they are still valid for live range info
MaterializeTransientSizeQueriesPass
- Uses analysis to find the public functions that take transients as arguments
- Creates new functions with special iree.abi.reflection annotation taking required args for the top-level transient ops
  - Each function gets the stream.resource.pack inserted for its top-level transients op
  - Returns the size of each transient
  - Order defined by some deterministic program order (order of the storage buffer argument in the function argument list, or something)
- Use markAnalysisPreserved() since they are still valid for live range info

After all this the LayoutSlicesPass will run and turn the stream.resource.pack ops into a bunch of math. We'll run CSE/canonicalization to try to fold everything. In any case without dynamic sizes all the functions will fold to returning a single arith.constant - if they are dynamic the IR will at least be simpler.

There's then one final pass after all that:

AnnotateConstantTransientSizePass
- For every new transient size query function:
  - If it folded to a constant:
    - Add iree.abi.reflection value for the constant value

Then if a user wants to fast path they can just check the reflection data at runtime and if it's constant not bother calling into the VM.

I want a detailed writeup of this plan with progress checkboxes on each phase so we can check them off as we go. I want to frontload everything besides the analysis/passes in stream so we'll first setup the iree.abi.transients support, add the hal.tensor.transients op (+ folders/etc), add tests for both, then add the stream.resource.transients op and hal->stream.resource.transients conversion, add tests for both, and then get into the meaty phase.

The primary phase will start with the analysis and AnnotateTransientStoragePass/VerifyTransientStoragePass - once we are sure we can calculate all the information we need we can build out the passes. We'll want to at least design the main passes in pseudo code to make sure our analysis provides the right information.

Final phase will be implementing the EmplaceTransientsPass and MaterializeTransientSizeQueriesPass. AnnotateConstantTransientSizePass will be an easy pass to add after that.

benvanik/external-transients.md

External Transients Implementation Plan

Overview

Phase 0: Foundation - ABI & HAL Layer ✅ COMPLETED

ABI Attribute Support

hal.tensor.transients Operation

Phase 1: Stream Layer Foundation ✅ COMPLETED

stream.resource.transients Operation

HAL → Stream Conversion

Phase 2: Stubbed End-to-End Implementation ✅ COMPLETED

EmplaceTransientsPass ✅ COMPLETED

MaterializeTransientSizeQueriesPass ✅ COMPLETED

AnnotateConstantTransientSizePass (Pulled Forward)

End-to-End Validation

Phase 3: Analysis Infrastructure

Transient Storage Analysis

Timeline Liveness Analysis

VerifyTransientStoragePass

AnnotateTransientStoragePass

Phase 4: Production Implementation Passes

Production EmplaceTransientsPass

Production MaterializeTransientSizeQueriesPass

Phase 5: Integration & Testing

Replace Stubbed Passes

Pipeline Integration

End-to-End Testing

Documentation

Implementation Strategy

Technical Notes

Timeline Liveness Analysis Details

Size Computation Hoisting

Verification Requirements

Current Status

Implementation Notes

Phase 2 Implementation Progress (2025-10-23)

EmplaceTransientsPass - Timeline Traversal (COMPLETED)

MaterializeTransientSizeQueriesPass (STUBBED)

AnnotateConstantTransientSizePass (STUBBED)

Recent Progress (2025-10-23)

API Simplification & Proper Dialect Layering ✅

Timeline Chain Preservation Fix ✅

Mutually Exclusive Branch Allocations ✅

Code Quality Improvements (2025-10-24) ✅