Skip to content

Instantly share code, notes, and snippets.

@benvanik
Last active October 24, 2025 07:59
Show Gist options
  • Select an option

  • Save benvanik/9f6ca5fdc17e0e50938bf34a2437cac2 to your computer and use it in GitHub Desktop.

Select an option

Save benvanik/9f6ca5fdc17e0e50938bf34a2437cac2 to your computer and use it in GitHub Desktop.
External transient storage design doc

External Transients Implementation Plan

Overview

Enable users to provide buffers for transient memory allocation in their functions, with generated query functions to calculate required sizes. This supports the kernel JIT use case where applications need control over transient allocations.

Motivation: Building a kernel JIT on top of IREE where users provide IR of their linalg ops, we compile it into dispatches, and our host code schedules it with transient allocation. Users need to control transient memory ahead of time, so we provide size query functions and let them pass storage buffers to functions (making zero allocations in steady state).


Phase 0: Foundation - ABI & HAL Layer ✅ COMPLETED

Goal: Set up the high-level ABI support and HAL tensor operations before diving into Stream dialect analysis.

ABI Attribute Support

  • Add iree.abi.transients unit attribute definition
  • Update WrapEntryPointsPass to recognize iree.abi.transients on !hal.buffer arguments
  • Implement lowering logic in WrapEntryPointsPass to convert iree.abi.transients to hal.tensor.transients ops
  • Add validation that only one iree.abi.transients attribute exists per function
  • Write ABI-level tests for attribute parsing and validation

hal.tensor.transients Operation

  • Define hal.tensor.transients op in HAL dialect TableGen
    • Design Decision: Single-tensor version (not variadic) for cleaner integration with ShapeAwareOp
    • Takes storage buffer (!hal.buffer) and single tensor value
    • Returns same tensor value (preserves SSA use-def)
    • Example: %result = hal.tensor.transients %tensor : tensor<?xf32>{%dim} from %storage : !hal.buffer
    • Supports optional affinity: %result = hal.tensor.transients on(#hal.device.affinity<@dev>) %tensor : tensor<?xf32>{%dim} from %storage : !hal.buffer
  • Implement op verifier (storage must be !hal.buffer or !hal.buffer_view)
  • Add basic folders/canonicalizers:
    • Fold hal.tensor.transients + hal.tensor.transients → single hal.tensor.transients (outer storage wins)
  • Add pass-through behavior in Flow dialect transformations
  • Write LIT tests for:
    • Basic op construction and verification
    • Folding patterns
    • Integration with WrapEntryPointsPass
    • End-to-end: iree.abi.transients on function arg → hal.tensor.transients in lowered IR

Example IR (Phase 0 output):

// Input (using WrapEntryPointsPass):
util.func public @my_fn(%arg0: tensor<?xf32>, %arg1: index, %arg2: !hal.buffer {iree.abi.transients}) -> tensor<?xf32> {
  %t = ....(%arg0, %arg1) ...
  util.return %t : tensor<?xf32>
}

// After WrapEntryPointsPass lowering:
util.func public @my_fn(%arg0: !hal.buffer_view, %arg1: index, %arg2: !hal.buffer) -> !hal.buffer_view {
  %arg0_tensor = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<?xf32>{%arg1}
  ...
  // Note: Single transients op per result tensor (not variadic)
  %result_annotated = hal.tensor.transients %result : tensor<?xf32>{%result_dim} from %arg2 : !hal.buffer
  %result_view = hal.tensor.export %result_annotated : tensor<?xf32>{%result_dim} -> !hal.buffer_view
  util.return %result_view : !hal.buffer_view
}

Phase 1: Stream Layer Foundation ✅ COMPLETED

Goal: Establish the Stream dialect equivalent and conversion pipeline.

stream.resource.transients Operation

  • Define stream.resource.transients op in Stream dialect TableGen
    • Design Decision: Single-resource version (not variadic) matching HAL design
    • Timeline-Aware: Takes optional await timepoint, returns resource + result timepoint
    • Takes variadic storage operands (AnyType for flexibility), single resource with size
    • Returns same resource (tied operation preserving SSA use-def) + timepoint
    • Optional affinity attribute
    • Example: %result, %result_tp = stream.resource.transients await(%tp) => %source : !stream.resource<*>{%size} from %storage : !hal.buffer => !stream.timepoint
    • With affinity: %result, %result_tp = stream.resource.transients on(#hal.device.affinity<@dev>) await(%tp) => %source : !stream.resource<*>{%size} from %storage : !hal.buffer => !stream.timepoint
    • Variadic storage: %result, %result_tp = stream.resource.transients await(%tp) => %source : !stream.resource<*>{%size} from %storage1, %storage2 : !hal.buffer, !hal.buffer_view => !stream.timepoint
  • Implement op verifier (resource and result types must match)
  • Add canonicalization patterns:
    • Fold consecutive stream.resource.transients → single op (outer storage wins)
  • Timeline integration implemented (Stream_TimelineOp trait)

HAL → Stream Conversion

  • Implement hal.tensor.transientsstream.resource.transients conversion
    • Convert tensors to !stream.resource types using transferTensorOperands
    • Insert stream.timepoint.barrier before transients op (resource → resource+timepoint)
    • Insert stream.timepoint.await after transients op (resource+timepoint → resource)
    • Preserve storage operands through conversion
    • Handle affinity attributes (automatic transfer insertion when crossing devices)
    • Preserve SSA use-def chains through tied operation
  • Write LIT tests for:
    • Basic conversion patterns (static and dynamic tensors)
    • Affinity handling (cross-device transfers)
    • Storage operand preservation
    • Integration with existing Stream transformations
    • Timeline-aware barrier/await insertion

Example IR (Phase 1 output):

// After HAL→Stream conversion:
util.func public @my_fn(%arg0: !stream.resource<*>, %arg0_size: index, %storage: !hal.buffer)
    -> (!stream.resource<*>, index) {
  %transient, %alloca_timepoint = stream.resource.alloca ... !stream.resource<transient>{%transient_size}
  %exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
  %dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) => %transient

  // Timeline-aware transient storage annotation
  // Insert barrier to materialize timepoint from result resource
  %result_with_tp, %result_tp = stream.timepoint.barrier %result : !stream.resource<*>{%result_size}
      => !stream.timepoint

  // Annotate with transient storage (timeline-aware - threads timepoint through)
  %result_annotated, %annotated_tp = stream.resource.transients await(%result_tp) =>
      %result_with_tp : !stream.resource<*>{%result_size}
      from %storage : !hal.buffer
      => !stream.timepoint

  // Await to resolve back to plain resource for return
  %final_result = stream.timepoint.await %annotated_tp => %result_annotated :
      !stream.resource<*>{%result_size}

  util.return %final_result, %result_size : !stream.resource<*>, index
}

Phase 2: Stubbed End-to-End Implementation ✅ COMPLETED

Goal: Get a simple working implementation for trivial cases to validate the IR design before building sophisticated analysis.

EmplaceTransientsPass ✅ COMPLETED

  • Scope Limitations (documented in pass):

    • Only handles public functions with stream.resource.transients ops
    • Assumes no function calls (single function only)
    • No complex timeline analysis needed
  • Core Transformation:

    • Find stream.resource.transients op and extract storage SSA value
    • Find all stream.resource.alloca ops in function (timeline traversal with Explorer)
    • SCF control flow support (scf.if, scf.for) with mutual exclusivity detection
    • Cross-region size hoisting with backward slicing
    • Create stream.resource.pack with non-overlapping liveness intervals:
      • Slot-based packing for mutually exclusive allocations
      • Conservative size hoisting (arith.maxui across branches)
      • Sequential ranges for non-exclusive allocations
      • Uses size SSA values from allocas (hoisted as needed)
      • Produces offsets + total size
    • Replace each stream.resource.alloca with stream.resource.subview:
      • Subview from pack result at computed offset
      • Timeline chain preservation (forward await timepoint)
    • Remove stream.resource.dealloca ops:
      • Forward await timepoint to users
    • Remove stream.resource.transients ops after emplacement
    • Take subview of user storage buffer for the pack
  • Write comprehensive LIT tests:

    • Single allocation case
    • Two allocations case
    • Zero allocations case (no-op)
    • Many allocations case
    • SCF control flow (11 tests in emplace_transients_scf.mlir)
    • Computed sizes across regions
    • Nested control flow
    • Error cases (function calls, private functions)

MaterializeTransientSizeQueriesPass ✅ COMPLETED

  • Walk Functions for Transient Packs:

    • Iterate over functions
    • Find stream.resource.pack ops with stream.experimental.transients attribute
  • Generate Size Query Function:

    • Create new function with all the same inputs as the original function
    • For each pack op found, clone the backward slice up to the input arguments into the new function
    • Add pack total size results in function order (if multiple packs found)
    • Return total size value(s)
  • Update Original Function:

    • Strip stream.experimental.transients attribute from pack op(s)
    • Add iree.reflection annotation of iree.abi.transients.size pointing to the size query function name
  • Write LIT tests:

    • Constant size query generation
    • Query function naming and annotations
    • Backward slicing for size computations

AnnotateConstantTransientSizePass (Pulled Forward)

  • Iterate over transient size query functions
  • Check if function body folded to arith.constant returns
  • Add iree.reflection metadata with constant size values
  • Write tests for:
    • Constant size detection and annotation
    • Verification that annotation matches actual computation

End-to-End Validation

  • Create simple example program:
    util.func public @simple(%arg0: tensor<4xf32>, %storage: !hal.buffer {iree.abi.transients}) -> tensor<4xf32> {
      // Simple computation that creates 1-2 transients
    }
  • Verify full pipeline runs: ABI → HAL → Stream → Stubbed passes → Size query
  • Confirm IR constructs are correct
  • Document limitations of stubbed implementation

Key Insight: This phase validates the IR design works end-to-end for simple cases before investing in complex analysis. Can be merged as one PR to get early feedback.


Phase 3: Analysis Infrastructure

Goal: Build the sophisticated analysis needed to handle real-world cases.

Transient Storage Analysis

  • Design analysis data structures:
    • Map: stream.resource.alloca SSA value → transient storage SSA value
    • Map: stream.resource.alloca SSA value → deallocation ops
    • Tracking for which resources belong to which transient storage
  • Implement DFX-based solver:
    • Seed solver with stream.resource.transients await/result timepoints
    • Use Explorer to walk defining/using timeline ops in worklist
    • Track alloca/dealloca resource SSA values
    • Compute transient storage attribution for each allocation
  • Add utility functions for querying analysis results
  • Consider making this a reusable analysis (DFX attribute) for other passes
  • Reference: /home/ben/src/iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp

Timeline Liveness Analysis

  • Design liveness scope data structures:
    • Scope identification (where in IR to insert pack ops)
    • Alloca ops covered by each scope
    • Transient storage attribution
    • Live range information (start/end timepoints)
  • Implement timeline walking algorithm:
    • Build timeline ordering (global numbering with overlap handling)
    • Compute liveness ranges over async timeline (not IR ops)
    • Cluster allocations by transient storage
  • Use transient storage analysis for alloca/dealloca mapping
  • Produce per-region scope information
  • Make analysis results queryable and preservable across passes

VerifyTransientStoragePass

  • Implement verification pass that runs both analyses
  • Check that transient emplacement is possible:
    • All allocation sizes computable from inputs or immutable globals
    • No mutable global loads in size computation
    • No side-effecting external calls in size computation
    • No loads of stream resource values in size computation
  • Provide clear error messages when verification fails
  • Add flag for "unsafe mode" (trust user to provide correct buffer size)
  • Use markAllAnalysesPreserved() since analysis remains valid
  • Write tests for:
    • Valid programs (computable sizes)
    • Invalid programs (dynamic sizes without ranges)
    • Error message quality

AnnotateTransientStoragePass

  • Implement debugging/test pass (similar to AnnotateAffinitiesPass)
  • Add informational attributes to:
    • Function ops (which transients they use)
    • Scope regions (pack location, covered allocas)
    • Call sites (transient propagation)
    • stream.resource.alloca/dealloca ops (scope attribution)
    • stream.resource.transients ops (storage info)
  • Use markAllAnalysesPreserved()
  • Write comprehensive LIT tests using CHECK directives:
    • Simple single-scope examples
    • Multiple scopes in one function
    • Cross-function transient propagation
    • Complex timeline scenarios

Key Insight: Timeline-based liveness uses async timepoint use-def chains to track allocations through asynchronous execution, rather than analyzing IR op ordering. This is the core innovation that makes this work.


Phase 4: Production Implementation Passes

Goal: Replace stubbed passes with production-quality implementations that handle all cases.

Production EmplaceTransientsPass

  • Call Graph Analysis:

    • Identify all functions using transients (directly or transitively)
    • Verify call graph structure allows signature changes
    • Build propagation plan for !stream.resource<transient> arguments
  • Function Signature Updates:

    • Add !stream.resource<transient> + index size parameters to functions needing transients
    • Update all call sites to pass transient subviews
    • Handle public vs private function distinctions
  • Size Computation Hoisting:

    • Extract size calculations from stream.resource.pack operations
    • Hoist size math to callers (forward/backward slicing)
    • Fold and simplify hoisted arithmetic
    • Build nested stream.resource.pack in callers
      • For call tree A→B→C: A contains pack of B+C, B contains pack of C
      • This lets A compute subview for B, B compute subview for C
  • Per-Scope Transformations (using analysis):

    • Insert stream.resource.pack operations using liveness analysis:
      • Takes actual liveness ranges (may overlap) + size SSA values
      • Produces optimized offsets + total size
    • Replace stream.resource.alloca with stream.resource.subview:
      • Subview from pack result at computed offset
      • Replace timepoint result with await operand (or stream.timepoint.immediate)
    • Replace stream.resource.dealloca:
      • Remove deallocation op
      • Replace timepoint result with await operand
    • Update call sites:
      • Compute subview for callee from pack result
      • Pass subview + size to callee
  • Storage Buffer Handling:

    • Track user storage SSA value through IR (function arg, hal.buffer.allocate, etc.)
    • Generate subviews from user storage in functions with stream.resource.transients
    • Add size assertions (verify user buffer is large enough)
  • Integer Range Analysis Integration:

    • Use util.assume.int ops for dynamic size bounds
    • Allow unhoistable values if max value is known
    • Over-allocate based on maximum possible size
  • Use markAllAnalysesPreserved()

  • Write comprehensive LIT tests for:

    • Call graph propagation (A→B, A→B→C)
    • Nested pack generation
    • Overlapping liveness ranges (actual packing optimization)
    • Subview calculations through call graphs
    • Size assertion insertion
    • Dynamic sizes with integer range analysis

Production MaterializeTransientSizeQueriesPass

  • Find Public Functions with Transients:

    • Use analysis to identify public functions taking transient storage arguments
    • Extract required input arguments for size computation
  • Generate Size Query Functions:

    • Create new public function with iree.abi.reflection annotation
    • Function signature: takes required args (e.g., !hal.buffer_view for dynamic shapes)
    • Returns one index per transient storage used
    • Deterministic ordering (by argument position or other consistent rule)
  • Populate Size Computation:

    • Clone top-level stream.resource.pack nest into query function
    • Include nested packs from callees (A's query includes pack of B+C if A calls B calls C)
    • Extract total size from each pack
    • Return size values
  • Verification:

    • Ensure no device execution in query functions (pure arithmetic only)
    • Verify no resource allocations needed
    • Confirm all sizes are computable from inputs
  • Use markAllAnalysesPreserved()

  • Write LIT tests for:

    • Multiple transients in one function
    • Dynamic sizes based on tensor dimensions
    • Nested function call size aggregation
    • Complex call graphs

Key Insight: Hoisting strategy propagates size calculations up the call graph, enabling callers to compute nested transient requirements and properly subview storage for callees.


Phase 5: Integration & Testing

Replace Stubbed Passes

  • Update pass pipeline to use production EmplaceTransientsPass
  • Update pass pipeline to use production MaterializeTransientSizeQueriesPass
  • Remove or mark stubbed implementations as deprecated/testing-only

Pipeline Integration

  • Add all passes to appropriate pass pipelines in correct order:
    • VerifyTransientStoragePass (early, for user feedback)
    • EmplaceTransientsPass (after ScheduleAllocationPass)
    • MaterializeTransientSizeQueriesPass (after EmplaceTransientsPass)
    • LayoutSlicesPass (existing, turns stream.resource.pack into math)
    • CSE/canonicalization (existing, folds size calculations)
    • AnnotateConstantTransientSizePass (final, adds reflection metadata)
  • Determine flag/configuration for enabling external transients feature
  • Update pass manager documentation

End-to-End Testing

  • Revisit simple example from Phase 2, verify it still works with production passes
  • Complex example: call graph with multiple transients
  • Dynamic shapes with util.assume.int bounds
  • Constant size fast-path
  • Error cases (uncomputable sizes, verification failures)
  • Comparison: stubbed vs production pass output quality (verify packing optimization)

Documentation

  • User guide for iree.abi.transients attribute usage
  • Size query function API documentation
  • Limitations and requirements:
    • Computable sizes (from inputs or immutable globals)
    • Timeline structure must be analyzable
    • Call graph structure requirements
  • Kernel JIT use case example
  • Migration guide from stubbed to production implementation

Implementation Strategy

PR Structure:

  1. PR 1: Phase 0-1 (ABI/HAL/Stream foundation) - validates IR design
  2. PR 2: Phase 2 (stubbed implementation) - validates end-to-end flow
  3. PR 3: Phase 3 (analysis infrastructure) - comprehensive analysis with test pass
  4. PR 4: Phase 4-5 (production passes + integration) - completes feature

Development Workflow:

  • Each phase delivers independent value
  • Can merge incrementally for early feedback
  • Analysis is testable in isolation before integration
  • Stubbed implementation de-risks the IR design

Key Files to Reference:

  • Analysis example: compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp
  • Pass patterns: compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp
  • Annotation pass: compiler/src/iree/compiler/Dialect/Stream/Transforms/AnnotateAffinities.cpp

Technical Notes

Timeline Liveness Analysis Details

Instead of building a liveness range over IR ops, we build it over the asynchronous timeline represented by timepoints:

  1. Globally number all timepoints
  2. Define an order with overlap
  3. Use use-def chain of timepoints to track liveness

Example:

%transient, %alloca_timepoint = stream.resource.alloca ... !stream.resource<transient>{%size}
%exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
%dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) => %transient
// Timeline-aware transient storage annotation threads timepoint through
%result_annotated, %annotated_tp = stream.resource.transients await(%dealloca_timepoint) =>
    %result : !stream.resource<*>{%result_size}
    from %storage : !hal.buffer
    => !stream.timepoint

Note: stream.resource.transients is timeline-aware, taking an optional await timepoint and returning a result timepoint. This allows it to integrate with the async timeline for proper synchronization. The timeline liveness analysis will track transient allocations through their timepoint use-def chains to discover which allocations should be emplaced into user-provided storage.

Size Computation Hoisting

For call graph A→B→C where all need transients:

  • A gets: pack_A (includes nested pack_B which includes nested pack_C)
  • B gets: pack_B (includes nested pack_C)
  • C gets: pack_C (leaf)

This allows:

  • A to compute total size and create subview for B
  • B to compute its total size (from A's subview) and create subview for C
  • C to use its allocated subview directly

Verification Requirements

Program is valid for transient emplacement if:

  • All allocation sizes are SSA values derived from:
    • Function arguments (e.g., tensor dimensions)
    • Immutable global values
    • Pure arithmetic operations
  • No unhoistable operations (unless util.assume.int provides bounds)
  • Timeline structure allows liveness analysis

Current Status

Started: 2025-10-22 Completed: Phase 0 (ABI & HAL Layer) ✅, Phase 1 (Stream Layer Foundation) ✅, Phase 2 (Stubbed Implementation) ✅ Current Phase: Phase 3 - Analysis Infrastructure (TODO) Next Steps:

  1. Implement AnnotateConstantTransientSizePass (Phase 2 remaining item)
  2. Design and implement Transient Storage Analysis (Phase 3)
  3. Design and implement Timeline Liveness Analysis (Phase 3)

Implementation Notes

Phase 0-1 Completed (2025-10-22):

  • Implemented single-tensor design (not variadic) for cleaner ShapeAwareOp integration
  • Added synchronization documentation (storage assumed immediately usable)
  • Created comprehensive LIT tests covering static/dynamic shapes, affinity, and conversions
  • All tests passing for HAL and Stream dialects

Timeline-Aware Fix (2025-10-22): After initial implementation, discovered that stream.resource.transients needed to be timeline-aware to properly integrate with Stream's async execution model:

  • Changed from Stream_PureOp to Stream_Op base class
  • Added Stream_TimelineOp trait and AttrSizedOperandSegments trait
  • Added optional await_timepoint input operand and result_timepoint output
  • Updated assembly format: await(%tp) => %resource : type{size} from storage => timepoint_type
  • Updated HAL→Stream conversion to insert stream.timepoint.barrier before and stream.timepoint.await after
  • Fixed TiedOpInterface to handle multiple results (only resource result is tied)
  • Updated all test expectations for timeline-aware format

Key Files Implemented:

  • compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td - hal.tensor.transients definition
  • compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.cpp - HAL op implementation and folding
  • compiler/src/iree/compiler/Dialect/HAL/Transforms/WrapEntryPoints.cpp - ABI attribute handling
  • compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.td - stream.resource.transients definition (timeline-aware)
  • compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp - Stream op implementation and canonicalization (timeline-aware)
  • compiler/src/iree/compiler/Dialect/Stream/Conversion/HALToStream/Patterns.cpp - HAL→Stream conversion (with barrier/await)
  • Test files: HAL/IR/test/{tensor_ops,tensor_folding,wrap_entry_points}.mlir, Stream/IR/test/{resource_ops,resource_folding}.mlir, Stream/Conversion/HALToStream/test/abi_ops.mlir

Phase 2 Preparation (2025-10-22):

  • Created pass definitions in Passes.td for all three Phase 2 passes
  • Created skeleton C++ implementations with TODO comments
  • Created comprehensive test files with positive and negative cases
  • Updated BUILD.bazel and regenerated CMakeLists.txt
  • All infrastructure compiles successfully

Phase 2 Implementation Progress (2025-10-23)

EmplaceTransientsPass - Timeline Traversal (COMPLETED)

Implementation Components:

  • ✅ Comprehensive LLVM_DEBUG logging with [emplace-transients] prefix and Explorer asmState
  • ✅ Timeline traversal using Explorer::walkDefiningOps for robust SSA use-def walking
  • ✅ Worklist algorithm to walk timeline backwards from stream.resource.transients ops
  • ✅ Full SCF control flow support (scf.if, scf.for) via Explorer's region branch handling
  • ✅ SetVector-based deduplication of alloca/dealloca ops (critical for loops)
  • ✅ Clean refactoring with TransientResource struct and gatherTransientResources function
  • ✅ Validation: public functions only, no function calls, single unique storage buffer
  • Fixed Explorer bug in /home/ben/src/iree/compiler/src/iree/compiler/Dialect/Util/Analysis/Explorer.cpp:859: Added bounds check for region ops with mismatched result/yield counts (fixes crash on stream.cmd.execute)

Data Structure:

// Information about transient allocations associated with a storage buffer.
struct TransientResource {
  Value storage;                                    // User-provided storage buffer SSA value
  SetVector<ResourceAllocaOp> allocaOps;           // All discovered transient allocations (deduplicated)
  SetVector<ResourceDeallocaOp> deallocaOps;       // All discovered transient deallocations (deduplicated)
};

// Main gathering function - populates transientResources by walking timeline.
static LogicalResult gatherTransientResources(
    FunctionOpInterface funcOp,
    Explorer &explorer,
    SmallVector<TransientResource> &transientResources);

Algorithm:

  1. Find all stream.resource.transients ops, validate single unique storage buffer per function
  2. Seed worklist with await_timepoint operands from all transients ops
  3. Use Explorer::walkDefiningOps to walk backwards through SSA use-def graph:
    • For each timepoint, Explorer handles OpResults, BlockArguments, and RegionBranchOps
    • If defining op is stream.resource.alloca: add to allocaOps SetVector
    • If defining op is stream.resource.dealloca: add to deallocaOps SetVector
    • If defining op implements TimelineOpInterface: add its getAwaitTimepoints() to worklist
  4. Continue until worklist exhausted → complete set of alloca/dealloca ops discovered

Test Coverage:

  • Simple linear timeline (single alloca/dealloca)
  • Multiple allocations (two sequential, many allocations)
  • SCF control flow (scf.if with allocation in one/both branches, scf.for with loop-carried allocations)
  • Negative tests: function calls, private functions, host synchronization

Files Modified:

  • compiler/src/iree/compiler/Dialect/Stream/Transforms/EmplaceTransients.cpp - Timeline traversal implementation
  • compiler/src/iree/compiler/Dialect/Util/Analysis/Explorer.cpp - Bug fix for region op bounds checking
  • compiler/src/iree/compiler/Dialect/Stream/Transforms/test/emplace_transients.mlir - Comprehensive test suite

Next Implementation Steps (TODO in EmplaceTransients.cpp):

  • Step 3: Hoist allocation sizes to function entry
    • Extract size values from each stream.resource.alloca op
    • Handle dynamic sizes (may need hoisting/cloning of computation)
    • Handle constant sizes (already at function scope)
  • Step 4: Create stream.resource.pack with all transient allocations
    • Add stream.experimental.transients UnitAttr for MaterializeTransientSizeQueriesPass
    • Compute offsets for non-overlapping layout (stubbed version: sequential packing)
  • Step 5: Replace each stream.resource.alloca with stream.resource.subview from pack result
    • Wire timepoints correctly (use await from transients or immediate)
  • Step 6: Remove all stream.resource.dealloca ops (memory externally managed)
  • Step 7: Subview user storage buffer for the packed range

MaterializeTransientSizeQueriesPass (STUBBED)

  • File: compiler/src/iree/compiler/Dialect/Stream/Transforms/MaterializeTransientSizeQueries.cpp
  • Status: Skeleton implementation with TODO comments

AnnotateConstantTransientSizePass (STUBBED)

  • File: compiler/src/iree/compiler/Dialect/Stream/Transforms/AnnotateConstantTransientSize.cpp
  • Status: Skeleton implementation with TODO comments

Recent Progress (2025-10-23)

API Simplification & Proper Dialect Layering ✅

Changed stream.resource.transients to single storage operand:

  • Removed variadic storage, added single storage: !stream.resource<transient> with explicit storage_size: index
  • Removed AttrSizedOperandSegments trait (no longer needed)
  • Assembly format now: from $storage : type($storage) {$storage_size}

Fixed HAL→Stream conversion to use proper imports:

  • Removed incorrect use of unrealized_conversion_cast
  • Now properly imports HAL buffers using stream.tensor.import with Lifetime::Transient
  • Handles both !hal.buffer and !hal.buffer_view inputs (extracts buffer with hal.buffer_view.buffer)
  • Uses hal.buffer.length to get storage size
  • Creates proper stream.resource<transient> for storage operand

Updated test coverage:

  • All HAL→Stream conversion tests now verify proper stream.tensor.import usage
  • Added test for !hal.buffer_view storage input (in addition to !hal.buffer)

Timeline Chain Preservation Fix ✅

Problem: EmplaceTransients was breaking timeline causality by replacing alloca timepoints with immediates.

Solution implemented:

  • replaceAllocasWithSubviews: Now forwards each alloca's await_timepoint to replace its result_timepoint
  • Preserves timeline: if alloca awaited %tp, downstream ops now await %tp directly
  • Deallocas: Already forward their await_timepoint (unchanged)
  • Transients ops: Removed after emplacement, forwarding both result_timepointawait_timepoint and resultresource

Files modified:

  • EmplaceTransients.cpp:413-459 - Updated replaceAllocasWithSubviews to preserve timeline
  • EmplaceTransients.cpp:688-717 - Added transients op removal with timepoint forwarding
  • TransientResource struct now stores transientsOps vector

Mutually Exclusive Branch Allocations ✅

Solution implemented (2025-10-23):

  • Mutual exclusivity detection: Uses RegionBranchOpInterface to identify when allocations are in sibling regions (e.g., then/else branches of scf.if)
  • Conservative size hoisting: Backward slicing + cloning to common dominator + arith.maxui of all possible sizes
  • Slot-based packing: Mutually exclusive allocations share the same pack "slot" with the hoisted max size, enabling memory reuse across exclusive paths
  • Cross-region size computation: Recursive handling of sizes defined in sibling regions

Test coverage:

  • emplace_transients.mlir: 8 basic tests
  • emplace_transients_scf.mlir: 11 SCF tests (mutually exclusive branches, loops, nested control flow, computed sizes)
  • ✅ Test cleanup (2025-10-23): Updated storage subview CHECK patterns to include full type information (source size, result size from pack)

Code Quality Improvements (2025-10-24) ✅

Simplified isPureOp() implementations:

  • Removed manual recursive walking logic from both EmplaceTransients.cpp and MaterializeTransientSizeQueries.cpp
  • Now uses mlir::isMemoryEffectFree() which automatically handles HasRecursiveMemoryEffects trait (used by SCF ops like scf.if, scf.for)
  • Reduced implementation from ~35 lines to ~13 lines per function
  • Removed "DO NOT SUBMIT" comment from MaterializeTransientSizeQueries.cpp
  • Removed unused mlir/Dialect/Arith/IR/Arith.h include

Key insight: MLIR's isMemoryEffectFree() already recursively checks nested operations for ops with the HasRecursiveMemoryEffects trait, making manual walks unnecessary.

Files modified:

  • compiler/src/iree/compiler/Dialect/Stream/Transforms/EmplaceTransients.cpp:255-266
  • compiler/src/iree/compiler/Dialect/Stream/Transforms/MaterializeTransientSizeQueries.cpp:41-53

Users are requesting that we let them provide a buffer to their functions and have all transient memory in the function be allocated from that. We can then generate a function that calculates how much transient memory is required so the application can query it before calling.

The motivating use for all of this is building a kernel JIT on top of IREE: they provide IR of their linalg ops, we compile it into one or more dispatches, and then we have our host code that schedules it and does the transient allocation. They need to be able to control their transient memory ahead of time (required by their applications) so this lets us give them the size queries to know how much transient memory a particular function requires (the maximum, if bounded) and then when they call the function they control where that transient memory comes from (we make no transient allocations). This way in the steady state if they alias their function results and have our passes run they'll have zero allocations.

We currently use attributes like iree.abi.output in the native WrapEntryPointsPass that look for !hal.buffer/!hal.buffer_view and insert new HAL ops like hal.tensor.import to communicate the HAL semantics to lower levels of the stack. My thought here is to add a new hal.tensor.transients op that takes the buffer/buffer_view and a list of tensor SSA values. This says "any transient memory used in the production of this tensor should be suballocated from this buffer" and similar to how iree.abi.output is a helper that gets lowered into hal.tensor.alias but a frontend like torch can directly use hal.tensor.alias at a lower-level so too can users: iree.abi.transients on a buffer/buffer_view arg gets lowered into hal.tensor.transients, and if the user wants a more complex behavior they can. The iree.abi.transiets attr would be on the arg as a UnitAttr: only one is allowed (when using iree.abi.*) and it lowers into the hal.tensor.transients on all arguments and all results. A power user could use hal.tensor.transients to separate some values from others (like in a NUMA system where they are sharding). We'd have some basic canonicalizers/folders that let us simplify the IR (hal.tensor.transients+hal.tensor.transients->hal.tensor.transients) or propagate it (if we wanted). The requirement for the user here is that all allocation sizes have to be computable from the inputs or immutable globals - if not, we error out unless they opt in to unsafe behavior with a flag where we just trust the user to have passed the correct sized buffer.

The hal.tensor.transients op would take the transient storage and one or more tensor values and return the tensor values (this keeps SSA use-def valid). e.g. %results:2 = hal.tensor.transients storage(%storage : !hal.buffer) %tensor0, %tensor1 : tensor<?xf32>, tensor<?xi32>.

So on input using the IREE WrapEntryPointsPass:

util.func public @my_fn(%arg0: tensor<?xf32>, %arg1: index, %arg2: !hal.buffer {iree.abi.transients}) -> tensor<?xf32> {
  %t = ....(%arg0, %arg1) ...
  util.return %t : tensor<?xf32>
}

Would lower to IR like this, which users could also annotate themselves if they wanted fine-grained control:

util.func public @my_fn(%arg0: !hal.buffer_view, %arg1: index, %arg2: !hal.buffer) -> (!hal.buffer_view) {
  %arg0_tensor = hal.tensor.import %arg0 ... : tensor<?xf32>
  ...
  %result_annotated = hal.tensor.transients storage(%arg2 : !hal.buffer) %result : tensor<?xf32>
  %result_view = hal.tensor.export %result_annotated : tensor<?xf32>
  util.return %result_view : !hal.buffer_view
}

The hal.tensor.transient op would pass through flow unchanged, and then like hal.tensor.import would be turned into a stream.resource.transients when going flow->stream. This op like all other stream ops would take !stream.resources instead of tensors. It'd be mostly untouched until after ScheduleAllocationPass where we have a bunch of stream.resource.alloca / stream.resource.dealloca.

So we expect a function like this at some point:

util.func public @my_fn(...) -> !hal.buffer_view {
  %result, %result_timepoint = stream.resource.alloca ... <external>{%result_size}
  %transient, %alloca_timepoint = stream.resource.alloca ... <transient>{%transient_size}
  %exec_timepoint = stream.cmd.execute with(%transient) { ... }
  %dealloca_timepoint = stream.resource.dealloca %transient
  %result_annotated, %transient_timepoint = stream.resource.transients on(#hal...) storage(%arg2 : !hal.buffer) await(%exec_timepoint) %result : !stream.resource<external>{%result_size}
  hal.
  util.return %result_export, %transient_timepoint : !hal.buffer_view, ...
}

The new stream.resource.transients op would mirror the hal.tensor.transients op but take the resources instead. When lowering from the hal.tensor.transients to stream.resource.transients we need the timepoints for each resource and can insert a stream.timepoint.barrier op for each + stream.timepoint.join to get it. This lets us be able to track from a particular resource through the timeline to where it was created, effectively using the use-def chain of the timepoints as the way of tracking liveness (the full slice of the use-def chain has all the alloca/dealloca on it).

This lets us see that here that the %result transient comes from %dealloca_timepoint which is a dealloca of %transient, that comes from %exec_timepoint, which comes from %alloca_timepoint which is an alloca of %transient:

util.func public @my_fn(...) {
  %result, %result_timepoint = stream.resource.alloca ... <external>{%result_size}
  %transient, %alloca_timepoint = stream.resource.alloca ... <transient>{%transient_size}
  %exec_timepoint = stream.cmd.execute await(%alloca_timepoint) with(%transient) { ... }
  %dealloca_timepoint = stream.resource.dealloca await(%exec_timepoint) %transient
  %result_annotated, %transient_timepoint = stream.resource.transients storage(%transients) await(%dealloca_timepoint) %result
}

Instead of building a liveness range over the IR ops in the function, we can build a liveness range over the asynchronous timeline represented by all the timepoints (globally number all timepoints, define an order with overlap). This of course doesn't work in all programs but those are the ones we must verify before we attempt to do this.

With that, we can cluster each allocated value by which transients they are bound to (if any). That gives us a map of user-provided transient -> every allocation in the program that uses it. From there, we can use the call graph to find all functions transitively called by all functions with transients as arguments, which lets us know who needs to have the new !stream.resource<transient> argument added to pass the user-provided buffer around.

We can also use this to check whether it's even possible: if the user did not ask for a size function then we don't need to do anything, but if they did we would error out if it wasn't possible to calculate the size without computation (nowhere is there a mutable global load, side-effecting external call, or load of a stream resource value) - otherwise, we know the exact use-def chain of every size value in the program. We want a full forward/backward slice in each function (starting at the leaves) so that we can hoist that math up in to callers. By doing so, we now give the callers a chance to fold in that math and then continue hoisting. By the end we'll end up with a big set of arithmetic (or a single folded constant) of everything that feeds in. To build the user requested size functions we'd take every transient required for a public function and make a function that has arguments required (if any, like !hal.buffer_view to get dynamic shape dimensions) and returns the size of each transient. Since the math is all pure we should always be able to hoist, as we already verified that unhoistable stuff was not present. In the end we have a function the user can call to get the size of the transient buffer they need for their given inputs.

We can use the call graph to verify that any function called from a function that marks transients with arguments that are colored needs to be called by either functions that need a transient buffer or functions that assign one too - this way we can safely change the function signatures to take a !stream.resource (+size index) for their transients. We'll want to update all calls to pass in one of their subviews that then the callee subviews again for its own internal transients.

With the liveness range over the timeline we can now put together stream.resource.pack operations that take the liveness ranges + size SSA values and produce a list of offsets + total size of the transient memory required. We'd replace the stream.resource.alloca with a stream.resource.subview and replace the timepoint result with the await of the alloca (or stream.timepoint.immediate) - same for the dealloca (no subview, just replace timepoint). In a function where the stream.resource.transient op exists we know the SSA value of the user storage (%arg0, a hal.buffer.allocate result, etc) and can take a subview of that. In functions called from functions using transient we need to add new !stream.resource<transient> arguments throughout the call graph so that we can always get the value to place within.

The big thing to solve here is that we need to know the size early so we can build subviews for callees. We want to keep the stream.resource.pack ops where they are locally so we get all of the offsets but need to repeatedly hoist the total size calculation all the way up to the root through all call edges to know what our total required size is (so we can subview from that). So in the end if there's a call graph of A->B->C, A will end up with the stream.resource.pack of itself (A) as well as B and C. This let's A nest the stream.resource.pack of B in itself (which contains the nest of C) to get the subview to pass to B, then B nest the stream.resource.pack of C in itself to get the subview for C, and C just has its own. If any function has no callees that need the transients they are untouched. If there's anything we can't hoist (pure ops on values always provided by callers or computed by values provided by callers/immutable global values) we should have failed to get to this point. This super hoisting operation is going to be tricky but we do it in some other places and it'd be good to have a robust utility for it. Since the user already allocated the buffer we can just add an assert that it's the right size on the final computed value of A's total size (or whichever subset of resources ended up in that transient storage).

We can probably handle some cases with dynamic values if we use integer range analysis (such as that provided by our util.assume.int ops). When computing the hoist of something if it relies on an unhoistable value and we have an analysis for it with a max value we could use that. This let's users have dynamic values produced by side-effecting functions or device readback by telling us what the expected range is and we just overallocate.

The final user feature is an optional new public function we create with that compacted transient size calculation. This new function would take whatever were required for the original function (buffer views to query their dimensions, etc) and produce one size value for every transient used by the function. We'd just stamp out the transient size stream.resource.pack nests as we did before and since they aren't reliant on any of the actual tensor data there should never be any device execution or allocation performed - just arith ops.

We'll want to try to split up this analysis and phases to try to make this all debuggable. I'm not quite sure how because the analysis will be so expensive and we don't want to muck up the IR too much and create trash between passes. The most expensive analysis is the initial SSA resource <-> SSA transient storage map and the timeline liveness analysis. We could use timeline liveness analysis in other passes too for things so it may be worth splitting out a DFX attribute for it that we'd be able to reuse (then we can also test that complex part independently in a test pass).

Example analysis (it'd be nice to share of possible):

  • /home/ben/src/iree/compiler/src/iree/compiler/Dialect/Stream/Transforms/ElideTimepoints.cpp

Basic steps:

  • Transient storage analysis:
    • For each stream.resource.transient op in the program:
      • Seed solver with the await and result timepoints
    • Use the Explorer to walk all defining/using timeline ops in a worklist:
      • Seed each stream.resource.alloca/stream.resource.dealloca op resource
    • Run the solver to compute:
      • For each alloca SSA result resource value:
        • Which ops deallocate it
        • Which transient storage SSA value it came from (anywhere in the program)
      • Whatever else we need for liveness analysis besides alloca/dealloca
  • Timeline liveness analysis:
    • Using the transient storage analysis for alloca/dealloca information
    • Walks timepoints to compute for each scope it can
    • Produces a set of scopes per region for specific transients, for each:
      • Where it should be inserted in the IR
      • What alloca ops it covers
      • Which transient they come from (maybe calculated - to make the analysis useful to other passes who don't care about transients)
  • VerifyTransientStoragePass
    • Run analysis and verify the program is possible to emplace
    • This is our user-visible verification pass
    • Use markAnalysisPreserved() since they are still valid for live range info
  • AnnotateTransientStoragePass
    • Optional debugging pass like AnnotateAffinitiesPass
    • Uses analysis to add informational attributes to each relevant op
      • Functions/scopes
      • Call sites
      • Alloca/dealloca ops
      • stream.resource.transient ops
    • This is our primary test pass (as we can get the exact results of the analysis for CHECK tests)
    • Use markAnalysisPreserved() since they are still valid for live range info
  • EmplaceTransientsPass (name open to suggestions of 'emplace' is the wrong verb)
    • Uses CallGraph + the scope analysis to identify all functions that use a transient that is non-local to them
    • Adds !stream.resource<transient>, index args to each function needing one (passing the size of the resource)
    • For each scope:
      • Inserts stream.resource.pack op using the liveness ranges/sizes from analysis
      • Replaces all stream.resource.alloca ops with stream.resource.subview from the stream.resource.pack
      • Replaces alloca/dealloca timepoint results with their await operands (handling immediate if needed)
      • For each call adds the subview of the required transient and its subview size calculated by the stream.resource.pack
    • Use markAnalysisPreserved() since they are still valid for live range info
  • MaterializeTransientSizeQueriesPass
    • Uses analysis to find the public functions that take transients as arguments
    • Creates new functions with special iree.abi.reflection annotation taking required args for the top-level transient ops
      • Each function gets the stream.resource.pack inserted for its top-level transients op
      • Returns the size of each transient
      • Order defined by some deterministic program order (order of the storage buffer argument in the function argument list, or something)
    • Use markAnalysisPreserved() since they are still valid for live range info

After all this the LayoutSlicesPass will run and turn the stream.resource.pack ops into a bunch of math. We'll run CSE/canonicalization to try to fold everything. In any case without dynamic sizes all the functions will fold to returning a single arith.constant - if they are dynamic the IR will at least be simpler.

There's then one final pass after all that:

  • AnnotateConstantTransientSizePass
    • For every new transient size query function:
      • If it folded to a constant:
        • Add iree.abi.reflection value for the constant value

Then if a user wants to fast path they can just check the reflection data at runtime and if it's constant not bother calling into the VM.

I want a detailed writeup of this plan with progress checkboxes on each phase so we can check them off as we go. I want to frontload everything besides the analysis/passes in stream so we'll first setup the iree.abi.transients support, add the hal.tensor.transients op (+ folders/etc), add tests for both, then add the stream.resource.transients op and hal->stream.resource.transients conversion, add tests for both, and then get into the meaty phase.

The primary phase will start with the analysis and AnnotateTransientStoragePass/VerifyTransientStoragePass - once we are sure we can calculate all the information we need we can build out the passes. We'll want to at least design the main passes in pseudo code to make sure our analysis provides the right information.

Final phase will be implementing the EmplaceTransientsPass and MaterializeTransientSizeQueriesPass. AnnotateConstantTransientSizePass will be an easy pass to add after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment