Enable users to provide buffers for transient memory allocation in their functions, with generated query functions to calculate required sizes. This supports the kernel JIT use case where applications need control over transient allocations.
Motivation: Building a kernel JIT on top of IREE where users provide IR of their linalg ops, we compile it into dispatches, and our host code schedules it with transient allocation. Users need to control transient memory ahead of time, so we provide size query functions and let them pass storage buffers to functions (making zero allocations in steady state).
Goal: Set up the high-level ABI support and HAL tensor operations before diving into Stream dialect analysis.
- Add
iree.abi.transientsunit attribute definition - Update
WrapEntryPointsPassto recognizeiree.abi.transientson!hal.bufferarguments - Implement lowering logic in
WrapEntryPointsPassto convertiree.abi.transientstohal.tensor.transientsops - Add validation that only one
iree.abi.transientsattribute exists per function - Write ABI-level tests for attribute parsing and validation
- Define
hal.tensor.transientsop in HAL dialect TableGen- Design Decision: Single-tensor version (not variadic) for cleaner integration with ShapeAwareOp
- Takes storage buffer (
!hal.buffer) and single tensor value - Returns same tensor value (preserves SSA use-def)
- Example:
%result = hal.tensor.transients %tensor : tensor<?xf32>{%dim} from %storage : !hal.buffer - Supports optional affinity:
%result = hal.tensor.transients on(#hal.device.affinity<@dev>) %tensor : tensor<?xf32>{%dim} from %storage : !hal.buffer
- Implement op verifier (storage must be
!hal.bufferor!hal.buffer_view) - Add basic folders/canonicalizers:
- Fold
hal.tensor.transients+hal.tensor.transients→ singlehal.tensor.transients(outer storage wins)
- Fold
- Add pass-through behavior in Flow dialect transformations
- Write LIT tests for:
- Basic op construction and verification
- Folding patterns
- Integration with
WrapEntryPointsPass - End-to-end:
iree.abi.transientson function arg →hal.tensor.transientsin lowered IR
Example IR (Phase 0 output):
// Input (using WrapEntryPointsPass):
util.func public @my_fn(%arg0: tensor<?xf32>, %arg1: index, %arg2: !hal.buffer {iree.abi.transients}) -> tensor<?xf32> {
%t = ....(%arg0, %arg1) ...
util.return %t : tensor<?xf32>
}
// After WrapEntryPointsPass lowering:
util.func public @my_fn(%arg0: !hal.buffer_view, %arg1: index, %arg2: !hal.buffer) -> !hal.buffer_view {
%arg0_tensor = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<?xf32>{%arg1}
...
// Note: Single transients op per result tensor (not variadic)
%result_annotated = hal.tensor.transients %result : tensor<?xf32>{%result_dim} from %arg2 : !hal.buffer
%result_view = hal.tensor.export %result_annotated : tensor<?xf32>{%result_dim} -> !hal.buffer_view
util.return %result_view : !hal.buffer_view
}Goal: Establish the Stream dialect equivalent and conversion pipeline.
- Define
stream.resource.transientsop in Stream dialect TableGen- Design Decision: Single-resource version (not variadic) matching HAL design
- Timeline-Aware: Takes optional await timepoint, returns resource + result timepoint
- Takes variadic storage operands (
AnyTypefor flexibility), single resource with size - Returns same resource (tied operation preserving SSA use-def) + timepoint
- Optional affinity attribute
- Example:
%result, %result_tp = stream.resource.transients await(%tp) => %source : !stream.resource<*>{%size} from %storage : !hal.buffer => !stream.timepoint - With affinity:
%result, %result_tp = stream.resource.transients on(#hal.device.affinity<@dev>) await(%tp) => %source : !stream.resource<*>{%size} from %storage : !hal.buffer => !stream.timepoint - Variadic storage:
%result, %result_tp = stream.resource.transients await(%tp) => %source : !stream.resource<*>{%size} from %storage1, %storage2 : !hal.buffer, !hal.buffer_view => !stream.timepoint
- Implement op verifier (resource and result types must match)
- Add canonicalization patterns:
- Fold consecutive
stream.resource.transients→ single op (outer storage wins)
- Fold consecutive
- Timeline integration implemented (Stream_TimelineOp trait)
- Implement
hal.tensor.transients→stream.resource.transientsconversion- Convert tensors to
!stream.resourcetypes usingtransferTensorOperands - Insert
stream.timepoint.barrierbefore transients op (resource → resource+timepoint) - Insert
stream.timepoint.awaitafter transients op (resource+timepoint → resource) - Preserve storage operands through conversion
- Handle affinity attributes (automatic transfer insertion when crossing devices)
- Preserve SSA use-def chains through tied operation
- Convert tensors to
- Write LIT tests for:
- Basic conversion patterns (static and dynamic tensors)
- Affinity handling (cross-device transfers)
- Storage operand preservation
- Integration with existing Stream transformations
- Timeline-aware barrier/await insertion
Example IR (Phase 1 output):
// After HAL→Stream conversion:
util.func public @my_fn(%arg0: !stream.resource<*>, %arg0_size: index, %storage: !hal.buffer)
-> (!stream.resource<*>, index) {
%transient, %alloca_timepoint = stream.resource.alloca ... !stream.resource<transient>{%transient_size}
%exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
%dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) => %transient
// Timeline-aware transient storage annotation
// Insert barrier to materialize timepoint from result resource
%result_with_tp, %result_tp = stream.timepoint.barrier %result : !stream.resource<*>{%result_size}
=> !stream.timepoint
// Annotate with transient storage (timeline-aware - threads timepoint through)
%result_annotated, %annotated_tp = stream.resource.transients await(%result_tp) =>
%result_with_tp : !stream.resource<*>{%result_size}
from %storage : !hal.buffer
=> !stream.timepoint
// Await to resolve back to plain resource for return
%final_result = stream.timepoint.await %annotated_tp => %result_annotated :
!stream.resource<*>{%result_size}
util.return %final_result, %result_size : !stream.resource<*>, index
}Goal: Get a simple working implementation for trivial cases to validate the IR design before building sophisticated analysis.
-
Scope Limitations (documented in pass):
- Only handles public functions with
stream.resource.transientsops - Assumes no function calls (single function only)
- No complex timeline analysis needed
- Only handles public functions with
-
Core Transformation:
- Find
stream.resource.transientsop and extract storage SSA value - Find all
stream.resource.allocaops in function (timeline traversal with Explorer) - SCF control flow support (scf.if, scf.for) with mutual exclusivity detection
- Cross-region size hoisting with backward slicing
- Create
stream.resource.packwith non-overlapping liveness intervals:- Slot-based packing for mutually exclusive allocations
- Conservative size hoisting (arith.maxui across branches)
- Sequential ranges for non-exclusive allocations
- Uses size SSA values from allocas (hoisted as needed)
- Produces offsets + total size
- Replace each
stream.resource.allocawithstream.resource.subview:- Subview from pack result at computed offset
- Timeline chain preservation (forward await timepoint)
- Remove
stream.resource.deallocaops:- Forward await timepoint to users
- Remove
stream.resource.transientsops after emplacement - Take subview of user storage buffer for the pack
- Find
-
Write comprehensive LIT tests:
- Single allocation case
- Two allocations case
- Zero allocations case (no-op)
- Many allocations case
- SCF control flow (11 tests in emplace_transients_scf.mlir)
- Computed sizes across regions
- Nested control flow
- Error cases (function calls, private functions)
-
Walk Functions for Transient Packs:
- Iterate over functions
- Find
stream.resource.packops withstream.experimental.transientsattribute
-
Generate Size Query Function:
- Create new function with all the same inputs as the original function
- For each pack op found, clone the backward slice up to the input arguments into the new function
- Add pack total size results in function order (if multiple packs found)
- Return total size value(s)
-
Update Original Function:
- Strip
stream.experimental.transientsattribute from pack op(s) - Add
iree.reflectionannotation ofiree.abi.transients.sizepointing to the size query function name
- Strip
-
Write LIT tests:
- Constant size query generation
- Query function naming and annotations
- Backward slicing for size computations
- Iterate over transient size query functions
- Check if function body folded to
arith.constantreturns - Add
iree.reflectionmetadata with constant size values - Write tests for:
- Constant size detection and annotation
- Verification that annotation matches actual computation
- Create simple example program:
util.func public @simple(%arg0: tensor<4xf32>, %storage: !hal.buffer {iree.abi.transients}) -> tensor<4xf32> { // Simple computation that creates 1-2 transients }
- Verify full pipeline runs: ABI → HAL → Stream → Stubbed passes → Size query
- Confirm IR constructs are correct
- Document limitations of stubbed implementation
Key Insight: This phase validates the IR design works end-to-end for simple cases before investing in complex analysis. Can be merged as one PR to get early feedback.
Goal: Build the sophisticated analysis needed to handle real-world cases.
- Design analysis data structures:
- Map:
stream.resource.allocaSSA value → transient storage SSA value - Map:
stream.resource.allocaSSA value → deallocation ops - Tracking for which resources belong to which transient storage
- Map:
- Implement DFX-based solver:
- Seed solver with
stream.resource.transientsawait/result timepoints - Use Explorer to walk defining/using timeline ops in worklist
- Track alloca/dealloca resource SSA values
- Compute transient storage attribution for each allocation
- Seed solver with
- Add utility functions for querying analysis results
- Consider making this a reusable analysis (DFX attribute) for other passes
- Reference:
/home/ben/src/iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp
- Design liveness scope data structures:
- Scope identification (where in IR to insert pack ops)
- Alloca ops covered by each scope
- Transient storage attribution
- Live range information (start/end timepoints)
- Implement timeline walking algorithm:
- Build timeline ordering (global numbering with overlap handling)
- Compute liveness ranges over async timeline (not IR ops)
- Cluster allocations by transient storage
- Use transient storage analysis for alloca/dealloca mapping
- Produce per-region scope information
- Make analysis results queryable and preservable across passes
- Implement verification pass that runs both analyses
- Check that transient emplacement is possible:
- All allocation sizes computable from inputs or immutable globals
- No mutable global loads in size computation
- No side-effecting external calls in size computation
- No loads of stream resource values in size computation
- Provide clear error messages when verification fails
- Add flag for "unsafe mode" (trust user to provide correct buffer size)
- Use
markAllAnalysesPreserved()since analysis remains valid - Write tests for:
- Valid programs (computable sizes)
- Invalid programs (dynamic sizes without ranges)
- Error message quality
- Implement debugging/test pass (similar to
AnnotateAffinitiesPass) - Add informational attributes to:
- Function ops (which transients they use)
- Scope regions (pack location, covered allocas)
- Call sites (transient propagation)
-
stream.resource.alloca/deallocaops (scope attribution) -
stream.resource.transientsops (storage info)
- Use
markAllAnalysesPreserved() - Write comprehensive LIT tests using CHECK directives:
- Simple single-scope examples
- Multiple scopes in one function
- Cross-function transient propagation
- Complex timeline scenarios
Key Insight: Timeline-based liveness uses async timepoint use-def chains to track allocations through asynchronous execution, rather than analyzing IR op ordering. This is the core innovation that makes this work.
Goal: Replace stubbed passes with production-quality implementations that handle all cases.
-
Call Graph Analysis:
- Identify all functions using transients (directly or transitively)
- Verify call graph structure allows signature changes
- Build propagation plan for
!stream.resource<transient>arguments
-
Function Signature Updates:
- Add
!stream.resource<transient>+indexsize parameters to functions needing transients - Update all call sites to pass transient subviews
- Handle public vs private function distinctions
- Add
-
Size Computation Hoisting:
- Extract size calculations from
stream.resource.packoperations - Hoist size math to callers (forward/backward slicing)
- Fold and simplify hoisted arithmetic
- Build nested
stream.resource.packin callers- For call tree A→B→C: A contains pack of B+C, B contains pack of C
- This lets A compute subview for B, B compute subview for C
- Extract size calculations from
-
Per-Scope Transformations (using analysis):
- Insert
stream.resource.packoperations using liveness analysis:- Takes actual liveness ranges (may overlap) + size SSA values
- Produces optimized offsets + total size
- Replace
stream.resource.allocawithstream.resource.subview:- Subview from pack result at computed offset
- Replace timepoint result with await operand (or
stream.timepoint.immediate)
- Replace
stream.resource.dealloca:- Remove deallocation op
- Replace timepoint result with await operand
- Update call sites:
- Compute subview for callee from pack result
- Pass subview + size to callee
- Insert
-
Storage Buffer Handling:
- Track user storage SSA value through IR (function arg,
hal.buffer.allocate, etc.) - Generate subviews from user storage in functions with
stream.resource.transients - Add size assertions (verify user buffer is large enough)
- Track user storage SSA value through IR (function arg,
-
Integer Range Analysis Integration:
- Use
util.assume.intops for dynamic size bounds - Allow unhoistable values if max value is known
- Over-allocate based on maximum possible size
- Use
-
Use
markAllAnalysesPreserved() -
Write comprehensive LIT tests for:
- Call graph propagation (A→B, A→B→C)
- Nested pack generation
- Overlapping liveness ranges (actual packing optimization)
- Subview calculations through call graphs
- Size assertion insertion
- Dynamic sizes with integer range analysis
-
Find Public Functions with Transients:
- Use analysis to identify public functions taking transient storage arguments
- Extract required input arguments for size computation
-
Generate Size Query Functions:
- Create new public function with
iree.abi.reflectionannotation - Function signature: takes required args (e.g.,
!hal.buffer_viewfor dynamic shapes) - Returns one
indexper transient storage used - Deterministic ordering (by argument position or other consistent rule)
- Create new public function with
-
Populate Size Computation:
- Clone top-level
stream.resource.packnest into query function - Include nested packs from callees (A's query includes pack of B+C if A calls B calls C)
- Extract total size from each pack
- Return size values
- Clone top-level
-
Verification:
- Ensure no device execution in query functions (pure arithmetic only)
- Verify no resource allocations needed
- Confirm all sizes are computable from inputs
-
Use
markAllAnalysesPreserved() -
Write LIT tests for:
- Multiple transients in one function
- Dynamic sizes based on tensor dimensions
- Nested function call size aggregation
- Complex call graphs
Key Insight: Hoisting strategy propagates size calculations up the call graph, enabling callers to compute nested transient requirements and properly subview storage for callees.
- Update pass pipeline to use production EmplaceTransientsPass
- Update pass pipeline to use production MaterializeTransientSizeQueriesPass
- Remove or mark stubbed implementations as deprecated/testing-only
- Add all passes to appropriate pass pipelines in correct order:
- VerifyTransientStoragePass (early, for user feedback)
- EmplaceTransientsPass (after ScheduleAllocationPass)
- MaterializeTransientSizeQueriesPass (after EmplaceTransientsPass)
- LayoutSlicesPass (existing, turns
stream.resource.packinto math) - CSE/canonicalization (existing, folds size calculations)
- AnnotateConstantTransientSizePass (final, adds reflection metadata)
- Determine flag/configuration for enabling external transients feature
- Update pass manager documentation
- Revisit simple example from Phase 2, verify it still works with production passes
- Complex example: call graph with multiple transients
- Dynamic shapes with
util.assume.intbounds - Constant size fast-path
- Error cases (uncomputable sizes, verification failures)
- Comparison: stubbed vs production pass output quality (verify packing optimization)
- User guide for
iree.abi.transientsattribute usage - Size query function API documentation
- Limitations and requirements:
- Computable sizes (from inputs or immutable globals)
- Timeline structure must be analyzable
- Call graph structure requirements
- Kernel JIT use case example
- Migration guide from stubbed to production implementation
PR Structure:
- PR 1: Phase 0-1 (ABI/HAL/Stream foundation) - validates IR design
- PR 2: Phase 2 (stubbed implementation) - validates end-to-end flow
- PR 3: Phase 3 (analysis infrastructure) - comprehensive analysis with test pass
- PR 4: Phase 4-5 (production passes + integration) - completes feature
Development Workflow:
- Each phase delivers independent value
- Can merge incrementally for early feedback
- Analysis is testable in isolation before integration
- Stubbed implementation de-risks the IR design
Key Files to Reference:
- Analysis example:
compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp - Pass patterns:
compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp - Annotation pass:
compiler/src/iree/compiler/Dialect/Stream/Transforms/AnnotateAffinities.cpp
Instead of building a liveness range over IR ops, we build it over the asynchronous timeline represented by timepoints:
- Globally number all timepoints
- Define an order with overlap
- Use use-def chain of timepoints to track liveness
Example:
%transient, %alloca_timepoint = stream.resource.alloca ... !stream.resource<transient>{%size}
%exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
%dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) => %transient
// Timeline-aware transient storage annotation threads timepoint through
%result_annotated, %annotated_tp = stream.resource.transients await(%dealloca_timepoint) =>
%result : !stream.resource<*>{%result_size}
from %storage : !hal.buffer
=> !stream.timepointNote: stream.resource.transients is timeline-aware, taking an optional await timepoint and returning a result timepoint. This allows it to integrate with the async timeline for proper synchronization. The timeline liveness analysis will track transient allocations through their timepoint use-def chains to discover which allocations should be emplaced into user-provided storage.
For call graph A→B→C where all need transients:
- A gets: pack_A (includes nested pack_B which includes nested pack_C)
- B gets: pack_B (includes nested pack_C)
- C gets: pack_C (leaf)
This allows:
- A to compute total size and create subview for B
- B to compute its total size (from A's subview) and create subview for C
- C to use its allocated subview directly
Program is valid for transient emplacement if:
- All allocation sizes are SSA values derived from:
- Function arguments (e.g., tensor dimensions)
- Immutable global values
- Pure arithmetic operations
- No unhoistable operations (unless
util.assume.intprovides bounds) - Timeline structure allows liveness analysis
Started: 2025-10-22 Completed: Phase 0 (ABI & HAL Layer) ✅, Phase 1 (Stream Layer Foundation) ✅, Phase 2 (Stubbed Implementation) ✅ Current Phase: Phase 3 - Analysis Infrastructure (TODO) Next Steps:
- Implement AnnotateConstantTransientSizePass (Phase 2 remaining item)
- Design and implement Transient Storage Analysis (Phase 3)
- Design and implement Timeline Liveness Analysis (Phase 3)
Phase 0-1 Completed (2025-10-22):
- Implemented single-tensor design (not variadic) for cleaner ShapeAwareOp integration
- Added synchronization documentation (storage assumed immediately usable)
- Created comprehensive LIT tests covering static/dynamic shapes, affinity, and conversions
- All tests passing for HAL and Stream dialects
Timeline-Aware Fix (2025-10-22):
After initial implementation, discovered that stream.resource.transients needed to be timeline-aware to properly integrate with Stream's async execution model:
- Changed from
Stream_PureOptoStream_Opbase class - Added
Stream_TimelineOptrait andAttrSizedOperandSegmentstrait - Added optional
await_timepointinput operand andresult_timepointoutput - Updated assembly format:
await(%tp) => %resource : type{size} from storage => timepoint_type - Updated HAL→Stream conversion to insert
stream.timepoint.barrierbefore andstream.timepoint.awaitafter - Fixed TiedOpInterface to handle multiple results (only resource result is tied)
- Updated all test expectations for timeline-aware format
Key Files Implemented:
compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td-hal.tensor.transientsdefinitioncompiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp- HAL op implementation and foldingcompiler/src/iree/compiler/Dialect/HAL/Transforms/WrapEntryPoints.cpp- ABI attribute handlingcompiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.td-stream.resource.transientsdefinition (timeline-aware)compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp- Stream op implementation and canonicalization (timeline-aware)compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp- HAL→Stream conversion (with barrier/await)- Test files:
HAL/IR/test/{tensor_ops,tensor_folding,wrap_entry_points}.mlir,Stream/IR/test/{resource_ops,resource_folding}.mlir,Stream/Conversion/HALToStream/test/abi_ops.mlir
Phase 2 Preparation (2025-10-22):
- Created pass definitions in
Passes.tdfor all three Phase 2 passes - Created skeleton C++ implementations with TODO comments
- Created comprehensive test files with positive and negative cases
- Updated BUILD.bazel and regenerated CMakeLists.txt
- All infrastructure compiles successfully
Implementation Components:
- ✅ Comprehensive LLVM_DEBUG logging with
[emplace-transients]prefix and Explorer asmState - ✅ Timeline traversal using
Explorer::walkDefiningOpsfor robust SSA use-def walking - ✅ Worklist algorithm to walk timeline backwards from
stream.resource.transientsops - ✅ Full SCF control flow support (scf.if, scf.for) via Explorer's region branch handling
- ✅ SetVector-based deduplication of alloca/dealloca ops (critical for loops)
- ✅ Clean refactoring with
TransientResourcestruct andgatherTransientResourcesfunction - ✅ Validation: public functions only, no function calls, single unique storage buffer
- ✅ Fixed Explorer bug in
/home/ben/src/iree/compiler/src/iree/compiler/Dialect/Util/Analysis/Explorer.cpp:859: Added bounds check for region ops with mismatched result/yield counts (fixes crash onstream.cmd.execute)
Data Structure:
// Information about transient allocations associated with a storage buffer.
struct TransientResource {
Value storage; // User-provided storage buffer SSA value
SetVector<ResourceAllocaOp> allocaOps; // All discovered transient allocations (deduplicated)
SetVector<ResourceDeallocaOp> deallocaOps; // All discovered transient deallocations (deduplicated)
};
// Main gathering function - populates transientResources by walking timeline.
static LogicalResult gatherTransientResources(
FunctionOpInterface funcOp,
Explorer &explorer,
SmallVector<TransientResource> &transientResources);Algorithm:
- Find all
stream.resource.transientsops, validate single unique storage buffer per function - Seed worklist with
await_timepointoperands from all transients ops - Use
Explorer::walkDefiningOpsto walk backwards through SSA use-def graph:- For each timepoint, Explorer handles OpResults, BlockArguments, and RegionBranchOps
- If defining op is
stream.resource.alloca: add to allocaOps SetVector - If defining op is
stream.resource.dealloca: add to deallocaOps SetVector - If defining op implements
TimelineOpInterface: add itsgetAwaitTimepoints()to worklist
- Continue until worklist exhausted → complete set of alloca/dealloca ops discovered
Test Coverage:
- Simple linear timeline (single alloca/dealloca)
- Multiple allocations (two sequential, many allocations)
- SCF control flow (
scf.ifwith allocation in one/both branches,scf.forwith loop-carried allocations) - Negative tests: function calls, private functions, host synchronization
Files Modified:
compiler/src/iree/compiler/Dialect/Stream/Transforms/EmplaceTransients.cpp- Timeline traversal implementationcompiler/src/iree/compiler/Dialect/Util/Analysis/Explorer.cpp- Bug fix for region op bounds checkingcompiler/src/iree/compiler/Dialect/Stream/Transforms/test/emplace_transients.mlir- Comprehensive test suite
Next Implementation Steps (TODO in EmplaceTransients.cpp):
- Step 3: Hoist allocation sizes to function entry
- Extract size values from each
stream.resource.allocaop - Handle dynamic sizes (may need hoisting/cloning of computation)
- Handle constant sizes (already at function scope)
- Extract size values from each
- Step 4: Create
stream.resource.packwith all transient allocations- Add
stream.experimental.transientsUnitAttr for MaterializeTransientSizeQueriesPass - Compute offsets for non-overlapping layout (stubbed version: sequential packing)
- Add
- Step 5: Replace each
stream.resource.allocawithstream.resource.subviewfrom pack result- Wire timepoints correctly (use await from transients or immediate)
- Step 6: Remove all
stream.resource.deallocaops (memory externally managed) - Step 7: Subview user storage buffer for the packed range
- File:
compiler/src/iree/compiler/Dialect/Stream/Transforms/MaterializeTransientSizeQueries.cpp - Status: Skeleton implementation with TODO comments
- File:
compiler/src/iree/compiler/Dialect/Stream/Transforms/AnnotateConstantTransientSize.cpp - Status: Skeleton implementation with TODO comments
Changed stream.resource.transients to single storage operand:
- Removed variadic storage, added single
storage: !stream.resource<transient>with explicitstorage_size: index - Removed
AttrSizedOperandSegmentstrait (no longer needed) - Assembly format now:
from $storage : type($storage) {$storage_size}
Fixed HAL→Stream conversion to use proper imports:
- Removed incorrect use of
unrealized_conversion_cast - Now properly imports HAL buffers using
stream.tensor.importwithLifetime::Transient - Handles both
!hal.bufferand!hal.buffer_viewinputs (extracts buffer withhal.buffer_view.buffer) - Uses
hal.buffer.lengthto get storage size - Creates proper
stream.resource<transient>for storage operand
Updated test coverage:
- All HAL→Stream conversion tests now verify proper
stream.tensor.importusage - Added test for
!hal.buffer_viewstorage input (in addition to!hal.buffer)
Problem: EmplaceTransients was breaking timeline causality by replacing alloca timepoints with immediates.
Solution implemented:
replaceAllocasWithSubviews: Now forwards each alloca'sawait_timepointto replace itsresult_timepoint- Preserves timeline: if alloca awaited
%tp, downstream ops now await%tpdirectly - Deallocas: Already forward their
await_timepoint(unchanged) - Transients ops: Removed after emplacement, forwarding both
result_timepoint→await_timepointandresult→resource
Files modified:
EmplaceTransients.cpp:413-459- UpdatedreplaceAllocasWithSubviewsto preserve timelineEmplaceTransients.cpp:688-717- Added transients op removal with timepoint forwardingTransientResourcestruct now storestransientsOpsvector
Solution implemented (2025-10-23):
- Mutual exclusivity detection: Uses
RegionBranchOpInterfaceto identify when allocations are in sibling regions (e.g., then/else branches ofscf.if) - Conservative size hoisting: Backward slicing + cloning to common dominator +
arith.maxuiof all possible sizes - Slot-based packing: Mutually exclusive allocations share the same pack "slot" with the hoisted max size, enabling memory reuse across exclusive paths
- Cross-region size computation: Recursive handling of sizes defined in sibling regions
Test coverage:
emplace_transients.mlir: 8 basic testsemplace_transients_scf.mlir: 11 SCF tests (mutually exclusive branches, loops, nested control flow, computed sizes)- ✅ Test cleanup (2025-10-23): Updated storage subview CHECK patterns to include full type information (source size, result size from pack)
Simplified isPureOp() implementations:
- Removed manual recursive walking logic from both
EmplaceTransients.cppandMaterializeTransientSizeQueries.cpp - Now uses
mlir::isMemoryEffectFree()which automatically handlesHasRecursiveMemoryEffectstrait (used by SCF ops likescf.if,scf.for) - Reduced implementation from ~35 lines to ~13 lines per function
- Removed "DO NOT SUBMIT" comment from MaterializeTransientSizeQueries.cpp
- Removed unused
mlir/Dialect/Arith/IR/Arith.hinclude
Key insight: MLIR's isMemoryEffectFree() already recursively checks nested operations for ops with the HasRecursiveMemoryEffects trait, making manual walks unnecessary.
Files modified:
compiler/src/iree/compiler/Dialect/Stream/Transforms/EmplaceTransients.cpp:255-266compiler/src/iree/compiler/Dialect/Stream/Transforms/MaterializeTransientSizeQueries.cpp:41-53