Skip to content

Instantly share code, notes, and snippets.

@AmosLewis
Created October 20, 2022 00:56
Show Gist options
  • Save AmosLewis/c4a7c1e78c478a8ab2f725298bd9ab2c to your computer and use it in GitHub Desktop.
Save AmosLewis/c4a7c1e78c478a8ab2f725298bd9ab2c to your computer and use it in GitHub Desktop.
OVERVIEW: MLIR modular optimizer driver
Available Dialects: acc, affine, amdgpu, amx, arith, arm_neon, arm_sve, async, bufferization, builtin, cf, complex, dlti, emitc, func, gpu, linalg, llvm, math, memref, ml_program, nvgpu, nvvm, omp, pdl, pdl_interp, quant, rocdl, scf, shape, sparse_tensor, spirv, tensor, tm_tensor, torch, torch_c, tosa, transform, vector, x86vector
USAGE: torch-mlir-opt [options] <input file>
OPTIONS:
Color Options:
--color - Use colors in output (default=autodetect)
General options:
--allow-unregistered-dialect - Allow operation with no registered dialects
--disable-i2p-p2i-opt - Disables inttoptr/ptrtoint roundtrip optimization
--dot-cfg-mssa=<file name for generated dot file> - file name for generated dot file
--emit-bytecode - Emit bytecode when generating output
--generate-merged-base-profiles - When generating nested context-sensitive profiles, always generate extra base profile for function with all its context profiles merged into it.
--mlir-debug-counter=<string> - Comma separated list of debug counter skip and count arguments
--mlir-disable-threading - Disable multi-threading within MLIR, overrides any further call to MLIRContext::enableMultiThreading()
--mlir-elide-elementsattrs-if-larger=<uint> - Elide ElementsAttrs with "..." that have more elements than the given upper limit
--mlir-pass-pipeline-crash-reproducer=<string> - Generate a .mlir reproducer file at the given output path if the pass manager crashes or fails
--mlir-pass-pipeline-local-reproducer - When generating a crash reproducer, attempt to generated a reproducer with the smallest pipeline.
--mlir-pass-statistics - Display the statistics of each pass
--mlir-pass-statistics-display=<value> - Display method for pass statistics
=list - display the results in a merged list sorted by pass name
=pipeline - display the results with a nested pipeline view
--mlir-pretty-debuginfo - Print pretty debug info in MLIR output
--mlir-print-debug-counter - Print out debug counter information after all counters have been accumulated
--mlir-print-debuginfo - Print debug info in MLIR output
--mlir-print-elementsattrs-with-hex-if-larger=<long> - Print DenseElementsAttrs with a hex string that have more elements than the given upper limit (use -1 to disable)
--mlir-print-ir-after=<pass-arg> - Print IR after specified passes
--mlir-print-ir-after-all - Print IR after each pass
--mlir-print-ir-after-change - When printing the IR after a pass, only print if the IR changed
--mlir-print-ir-after-failure - When printing the IR after a pass, only print if the pass failed
--mlir-print-ir-before=<pass-arg> - Print IR before specified passes
--mlir-print-ir-before-all - Print IR before each pass
--mlir-print-ir-module-scope - When printing IR for print-ir-[before|after]{-all} always print the top-level operation
--mlir-print-local-scope - Print with local scope and inline information (eliding aliases for attributes, types, and locations
--mlir-print-op-on-diagnostic - When a diagnostic is emitted on an operation, also print the operation as an attached note
--mlir-print-stacktrace-on-diagnostic - When a diagnostic is emitted, also print the stack trace as an attached note
--mlir-print-value-users - Print users of operation results and block arguments as a comment
--mlir-timing - Display execution times
--mlir-timing-display=<value> - Display method for timing data
=list - display the results in a list sorted by total time
=tree - display the results ina with a nested tree view
--no-implicit-module - Disable implicit addition of a top-level module op during parsing
-o <filename> - Output filename
--opaque-pointers - Use opaque pointers
Compiler passes to run
--pass-pipeline - A textual description of a pass pipeline to run
Passes:
--affine-data-copy-generate - Generate explicit copying for affine memory operations
--fast-mem-capacity=<ulong> - Set fast memory space capacity in KiB (default: unlimited)
--fast-mem-space=<uint> - Fast memory space identifier for copy generation (default: 1)
--generate-dma - Generate DMA instead of point-wise copy
--min-dma-transfer=<int> - Minimum DMA transfer size supported by the target in bytes
--skip-non-unit-stride-loops - Testing purposes: avoid non-unit stride loop choice depths for copy placement
--slow-mem-space=<uint> - Slow memory space identifier for copy generation (default: 0)
--tag-mem-space=<uint> - Tag memory space identifier for copy generation (default: 0)
--affine-expand-index-ops - Lower affine operations operating on indices into more fundamental operations
--affine-loop-coalescing - Coalesce nested loops with independent bounds into a single loop
--affine-loop-fusion - Fuse affine loop nests
--fusion-compute-tolerance=<number> - Fractional increase in additional computation tolerated while fusing
--fusion-fast-mem-space=<uint> - Faster memory space number to promote fusion buffers to
--fusion-local-buf-threshold=<ulong> - Threshold size (KiB) for promoting local buffers to fast memory space
--fusion-maximal - Enables maximal loop fusion
--mode=<value> - fusion mode to attempt
=greedy - Perform greedy (both producer-consumer and sibling) fusion
=producer - Perform only producer-consumer fusion
=sibling - Perform only sibling fusion
--affine-loop-invariant-code-motion - Hoist loop invariant instructions outside of affine loops
--affine-loop-normalize - Apply normalization transformations to affine loop-like ops
--affine-loop-tile - Tile affine loop nests
--cache-size=<ulong> - Set size of cache to tile for in KiB (default: 512)
--separate - Separate full and partial tiles (default: false)
--tile-size=<uint> - Use this tile size for all loops
--tile-sizes=<uint> - List of tile sizes for each perfect nest (overridden by -tile-size)
--affine-loop-unroll - Unroll affine loops
--cleanup-unroll - Fully unroll the cleanup loop when possible.
--unroll-factor=<uint> - Use this unroll factor for all loops being unrolled
--unroll-full - Fully unroll loops
--unroll-full-threshold=<uint> - Unroll all loops with trip count less than or equal to this
--unroll-num-reps=<uint> - Unroll innermost loops repeatedly this many times
--unroll-up-to-factor - Allow unrolling up to the factor specified
--affine-loop-unroll-jam - Unroll and jam affine loops
--unroll-jam-factor=<uint> - Use this unroll jam factor for all loops (default 4)
--affine-parallelize - Convert affine.for ops into 1-D affine.parallel
--max-nested=<uint> - Maximum number of nested parallel loops to produce. Defaults to unlimited (UINT_MAX).
--parallel-reductions - Whether to parallelize reduction loops. Defaults to false.
--affine-pipeline-data-transfer - Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy
--affine-scalrep - Replace affine memref acceses by scalars by forwarding stores to loads and eliminating redundant loads
--affine-simplify-structures - Simplify affine expressions in maps/sets and normalize memrefs
--affine-super-vectorize - Vectorize to a target independent n-D vector abstraction
--test-fastest-varying=<long> - Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes
--vectorize-reductions - Vectorize known reductions expressed via iter_args. Switched off by default.
--virtual-vector-size=<long> - Specify an n-D virtual vector size for vectorization
--arith-bufferize - Bufferize Arith dialect ops.
--alignment=<uint> - Create global memrefs with a specified alignment
--arith-emulate-wide-int - Emulate 2*N-bit integer operations using N-bit operations
--widest-int-supported=<uint> - Widest integer type supported by the target
--arith-expand - Legalize Arith ops to be convertible to LLVM.
--arith-unsigned-when-equivalent - Replace signed ops with unsigned ones where they are proven equivalent
--arm-neon-2d-to-intr - Convert Arm NEON structured ops to intrinsics
--async-parallel-for - Convert scf.parallel operations to multiple async compute ops executed concurrently for non-overlapping iteration ranges
--async-dispatch - Dispatch async compute tasks using recursive work splitting. If `false` async compute tasks will be launched using simple for loop in the caller thread.
--min-task-size=<int> - The minimum task size for sharding parallel operation.
--num-workers=<int> - The number of available workers to execute async operations. If `-1` the value will be retrieved from the runtime.
--async-runtime-policy-based-ref-counting - Policy based reference counting for Async runtime operations
--async-runtime-ref-counting - Automatic reference counting for Async runtime operations
--async-runtime-ref-counting-opt - Optimize automatic reference counting operations for theAsync runtime by removing redundant operations
--async-to-async-runtime - Lower high level async operations (e.g. async.execute) to theexplicit async.runtime and async.coro operations
--eliminate-blocking-await-ops - Rewrite functions with blocking async.runtime.await as coroutines with async.runtime.await_and_resume.
--buffer-deallocation - Adds all required dealloc operations for all allocations in the input program
--buffer-hoisting - Optimizes placement of allocation operations by moving them into common dominators and out of nested regions
--buffer-loop-hoisting - Optimizes placement of allocation operations by moving them out of loop nests
--buffer-results-to-out-params - Converts memref-typed function results to out-params
--bufferization-bufferize - Bufferize the `bufferization` dialect
--canonicalize - Canonicalize operations
--disable-patterns=<string> - Labels of patterns that should be filtered out during application
--enable-patterns=<string> - Labels of patterns that should be used during application, all other patterns are filtered out
--max-iterations=<long> - Seed the worklist in general top-down order
--region-simplify - Seed the worklist in general top-down order
--top-down - Seed the worklist in general top-down order
--control-flow-sink - Sink operations into conditional blocks
--convert-affine-for-to-gpu - Convert top-level AffineFor Ops to GPU kernels
--gpu-block-dims=<uint> - Number of GPU block dimensions for mapping
--gpu-thread-dims=<uint> - Number of GPU thread dimensions for mapping
--convert-amdgpu-to-rocdl - Convert AMDGPU dialect to ROCDL dialect
--chipset=<string> - Chipset that these operations will run on
--convert-arith-to-llvm - Convert Arith dialect to LLVM dialect
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--convert-arith-to-spirv - Convert Arith dialect to SPIR-V dialect
--emulate-non-32-bit-scalar-types - Emulate non-32-bit scalar types with 32-bit ones if missing native support
--enable-fast-math - Enable fast math mode (assuming no NaN and infinity for floating point values) when performing conversion
--convert-async-to-llvm - Convert the operations from the async dialect into the LLVM dialect
--convert-bufferization-to-memref - Convert operations from the Bufferization dialect to the MemRef dialect
--convert-cf-to-llvm - Convert ControlFlow operations to the LLVM dialect
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--convert-cf-to-spirv - Convert ControlFlow dialect to SPIR-V dialect
--emulate-non-32-bit-scalar-types - Emulate non-32-bit scalar types with 32-bit ones if missing native support
--convert-complex-to-libm - Convert Complex dialect to libm calls
--convert-complex-to-llvm - Convert Complex dialect to LLVM dialect
--convert-complex-to-standard - Convert Complex dialect to standard dialect
--convert-elementwise-to-linalg - Convert ElementwiseMappable ops to linalg
--convert-func-to-llvm - Convert from the Func dialect to the LLVM dialect
--data-layout=<string> - String description (LLVM format) of the data layout that is expected on the produced module
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--use-bare-ptr-memref-call-conv - Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
--convert-func-to-spirv - Convert Func dialect to SPIR-V dialect
--emulate-non-32-bit-scalar-types - Emulate non-32-bit scalar types with 32-bit ones if missing native support
--convert-gpu-launch-to-vulkan-launch - Convert gpu.launch_func to vulkanLaunch external call
--convert-gpu-to-nvvm - Generate NVVM operations for gpu operations
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--convert-gpu-to-rocdl - Generate ROCDL operations for gpu operations
--chipset=<string> - Chipset that these operations will run on
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--runtime=<value> - Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)
=unknown - Unknown (default)
=HIP - HIP
=OpenCL - OpenCL
--use-bare-ptr-memref-call-conv - Replace memref arguments in GPU functions with bare pointers.All memrefs must have static shape
--convert-gpu-to-spirv - Convert GPU dialect to SPIR-V dialect
--convert-linalg-to-affine-loops - Lower the operations from the linalg dialect into affine loops
--convert-linalg-to-llvm - Convert the operations from the linalg dialect into the LLVM dialect
--convert-linalg-to-loops - Lower the operations from the linalg dialect into loops
--convert-linalg-to-parallel-loops - Lower the operations from the linalg dialect into parallel loops
--convert-linalg-to-spirv - Convert Linalg dialect to SPIR-V dialect
--convert-linalg-to-std - Convert the operations from the linalg dialect into the Standard dialect
--convert-math-to-funcs - Convert Math operations to calls of outlined implementations.
--convert-math-to-libm - Convert Math dialect to libm calls
--convert-math-to-llvm - Convert Math dialect to LLVM dialect
--convert-math-to-spirv - Convert Math dialect to SPIR-V dialect
--convert-memref-to-llvm - Convert operations from the MemRef dialect to the LLVM dialect
--index-bitwidth=<uint> - Bitwidth of the index type, 0 to use size of machine word
--use-aligned-alloc - Use aligned_alloc in place of malloc for heap allocations
--use-generic-functions - Use generic allocation and deallocation functions instead of the classic 'malloc', 'aligned_alloc' and 'free' functions
--convert-memref-to-spirv - Convert MemRef dialect to SPIR-V dialect
--bool-num-bits=<int> - The number of bits to store a boolean value
--convert-nvgpu-to-nvvm - Convert NVGPU dialect to NVVM dialect
--convert-openacc-to-llvm - Convert the OpenACC ops to LLVM dialect
--convert-openacc-to-scf - Convert the OpenACC ops to OpenACC with SCF dialect
--convert-openmp-to-llvm - Convert the OpenMP ops to OpenMP ops with LLVM dialect
--convert-parallel-loops-to-gpu - Convert mapped scf.parallel ops to gpu launch operations
--convert-pdl-to-pdl-interp - Convert PDL ops to PDL interpreter ops
--convert-scf-to-cf - Convert SCF dialect to ControlFlow dialect, replacing structured control flow with a CFG
--convert-scf-to-openmp - Convert SCF parallel loop to OpenMP parallel + workshare constructs.
--convert-scf-to-spirv - Convert SCF dialect to SPIR-V dialect.
--convert-shape-constraints - Convert shape constraint operations to the standard dialect
--convert-shape-to-std - Convert operations from the shape dialect into the standard dialect
--convert-spirv-to-llvm - Convert SPIR-V dialect to LLVM dialect
--convert-tensor-to-linalg - Convert some Tensor dialect ops to Linalg dialect
--convert-tensor-to-spirv - Convert Tensor dialect to SPIR-V dialect
--emulate-non-32-bit-scalar-types - Emulate non-32-bit scalar types with 32-bit ones if missing native support
--convert-torch-to-arith - Convert recognized Torch ops to Std ops
--convert-torch-to-linalg - Convert recognized Torch ops to Linalg ops
--convert-torch-to-mhlo - Convert Torch ops to MHLO ops
--enable-i32-index - Enable truncate index from i64 to i32(unsafely)
--enable-static-shape - Enable static shape conversion
--convert-torch-to-scf - Convert recognized Torch ops to SCF ops
--convert-torch-to-tmtensor - Convert recognized Torch ops to TMTensor/Linalg ops
--convert-torch-to-tosa - Convert Torch ops to TOSA ops
--convert-vector-to-gpu - Lower the operations from the vector dialect into the GPU dialect
--use-nvgpu - convert to NvGPU ops instead of GPU dialect ops
--convert-vector-to-llvm - Lower the operations from the vector dialect into the LLVM dialect
--enable-amx - Enables the use of AMX dialect while lowering the vector dialect.
--enable-arm-neon - Enables the use of ArmNeon dialect while lowering the vector dialect.
--enable-arm-sve - Enables the use of ArmSVE dialect while lowering the vector dialect.
--enable-x86vector - Enables the use of X86Vector dialect while lowering the vector dialect.
--force-32bit-vector-indices - Allows compiler to assume vector indices fit in 32-bit if that yields faster code
--reassociate-fp-reductions - Allows llvm to reassociate floating-point reductions for speed
--convert-vector-to-scf - Lower the operations from the vector dialect into the SCF dialect
--full-unroll - Perform full unrolling when converting vector transfers to SCF
--lower-permutation-maps - Replace permutation maps with vector transposes/broadcasts before lowering transfer ops
--lower-tensors - Lower transfer ops that operate on tensors
--target-rank=<uint> - Target vector rank to which transfer ops should be lowered
--convert-vector-to-spirv - Convert Vector dialect to SPIR-V dialect
--cse - Eliminate common sub-expressions
--decorate-spirv-composite-type-layout - Decorate SPIR-V composite type with layout info
--drop-equivalent-buffer-results - Remove MemRef return values that are equivalent to a bbArg
--eliminate-alloc-tensors - Try to eliminate all alloc_tensor ops.
--empty-tensor-to-alloc-tensor - Replace all empty ops by alloc_tensor ops.
--finalizing-bufferize - Finalize a partial bufferization
--fold-memref-alias-ops - Fold memref alias ops into consumer load/store ops
--func-bufferize - Bufferize func/call/return ops
--gpu-async-region - Make GPU ops async
--gpu-kernel-outlining - Outline gpu.launch bodies to kernel functions
--data-layout-str=<string> - String containing the data layout specification to be attached to the GPU kernel module
--gpu-launch-sink-index-computations - Sink index computations into gpu.launch body
--gpu-map-parallel-loops - Greedily maps loops to GPU hardware dimensions.
--gpu-to-llvm - Convert GPU dialect to LLVM dialect with GPU runtime calls
--gpu-binary-annotation=<string> - Annotation attribute string for GPU binary
--use-bare-pointers-for-kernels - Use bare pointers to pass memref arguments to kernels. The kernel must use the same setting for this option.
--hlo-legalize-to-linalg - Legalize from HLO dialect to Linalg dialect.
--inline - Inline function calls
--default-pipeline=<string> - The default optimizer pipeline used for callables
--max-iterations=<uint> - Maximum number of iterations when inlining within an SCC
--op-pipelines=<pass-manager> - Callable operation specific optimizer pipelines (in the form of `dialect.op(pipeline)`)
--launch-func-to-vulkan - Convert vulkanLaunch external call to Vulkan runtime external calls
--linalg-bufferize - Bufferize the linalg dialect
--linalg-detensorize - Detensorize linalg ops
--aggressive-mode - Detensorize all ops that qualify for detensoring along with branch operands and basic-block arguments.
--linalg-fold-unit-extent-dims - Remove unit-extent dimension in Linalg ops on tensors
--fold-one-trip-loops-only - Only folds the one-trip loops from Linalg ops on tensors (for testing purposes only)
--linalg-fuse-elementwise-ops - Fuse elementwise operations on tensors
--linalg-generalize-named-ops - Convert named ops into generic ops
--linalg-inline-scalar-operands - Inline scalar operands into linalg generic ops
--linalg-named-op-conversion - Convert from one named linalg op to another.
--llvm-legalize-for-export - Legalize LLVM dialect to be convertible to LLVM IR
--llvm-optimize-for-nvvm-target - Optimize NVVM IR
--llvm-request-c-wrappers - Request C wrapper emission for all functions
--loop-invariant-code-motion - Hoist loop invariant instructions outside of the loop
--lower-affine - Lower Affine operations to a combination of Standard and SCF operations
--lower-host-to-llvm - Lowers the host module code and `gpu.launch_func` to LLVM
--map-memref-spirv-storage-class - Map numeric MemRef memory spaces to SPIR-V storage classes
--client-api=<string> - The client API to use for populating mappings
--memref-emulate-wide-int - Emulate 2*N-bit integer operations using N-bit operations
--widest-int-supported=<uint> - Widest integer type supported by the target
--memref-expand - Legalize memref operations to be convertible to LLVM.
--normalize-memrefs - Normalize memrefs
--nvgpu-optimize-shared-memory - Optimizes accesses to shard memory memrefs in order to reduce bank conflicts.
--one-shot-bufferize - One-Shot Bufferize
--allow-return-allocs - Allows returning/yielding new allocations from a block.
--allow-unknown-ops - Allows unknown (not bufferizable) ops in the input IR.
--analysis-fuzzer-seed=<uint> - Test only: Analyze ops in random order with a given seed (fuzzer)
--analysis-heuristic=<string> - Heuristic that control the IR traversal during analysis
--bufferize-function-boundaries - Bufferize function boundaries (experimental).
--copy-before-write - Skip the analysis. Make a buffer copy on every write.
--create-deallocs - Specify if buffers should be deallocated. For compatibility with core bufferization passes.
--dialect-filter=<string> - Restrict bufferization to ops from these dialects.
--function-boundary-type-conversion=<string> - Controls layout maps when bufferizing function signatures.
--must-infer-memory-space - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
--print-conflicts - Test only: Annotate IR with RaW conflicts. Requires test-analysis-only.
--test-analysis-only - Test only: Only run inplaceability analysis and annotate IR
--unknown-type-conversion=<string> - Controls layout maps for non-inferrable memref types.
--outline-shape-computation - Using shape.func to preserve shape computation
--print-op-stats - Print statistics of operations
--json - print the stats as JSON
--promote-buffers-to-stack - Promotes heap-based allocations to automatically managed stack-based allocations
--max-alloc-size-in-bytes=<uint> - Maximal size in bytes to promote allocations to stack.
--max-rank-of-allocated-memref=<uint> - Maximal memref rank to promote dynamic buffers.
--reconcile-unrealized-casts - Simplify and eliminate unrealized conversion casts
--refback-expand-ops-for-llvm - Expand ops into more primitive ops before LLVM lowering.
--refback-generalize-tensor-pad - Convert tensor.pad to linalg ops
--refback-insert-rng-globals - Insert global variables and sequence to get the next global seed for RNG ops
--refback-munge-calling-conventions - Munge calling conventions for calling via ExecutionEngine
--refback-munge-memref-copy - Munge memref.copy to linalg.copy
--remove-shape-constraints - Replace all cstr_ ops with a true witness
--resolve-ranked-shaped-type-result-dims - Resolve memref.dim of result values of ranked shape type
--resolve-shaped-type-result-dims - Resolve memref.dim of result values
--sccp - Sparse Conditional Constant Propagation
--scf-bufferize - Bufferize the scf dialect.
--scf-for-loop-canonicalization - Canonicalize operations within scf.for loop bodies
--scf-for-loop-peeling - Peel `for` loops at their upper bounds.
--skip-partial - Do not peel loops inside of the last, partial iteration of another already peeled loop.
--scf-for-loop-range-folding - Fold add/mul ops into loop range
--scf-for-loop-specialization - Specialize `for` loops for vectorization
--scf-for-to-while - Convert SCF for loops to SCF while loops
--scf-parallel-loop-collapsing - Collapse parallel loops to use less induction variables
--collapsed-indices-0=<uint> - Which loop indices to combine 0th loop index
--collapsed-indices-1=<uint> - Which loop indices to combine into the position 1 loop index
--collapsed-indices-2=<uint> - Which loop indices to combine into the position 2 loop index
--scf-parallel-loop-fusion - Fuse adjacent parallel loops
--scf-parallel-loop-specialization - Specialize parallel loops for vectorization
--scf-parallel-loop-tiling - Tile parallel loops
--no-min-max-bounds - Perform tiling with fixed upper bound with inbound check inside the internal loops
--parallel-loop-tile-sizes=<long> - Factors to tile parallel loops by
--shape-bufferize - Bufferize the shape dialect.
--shape-to-shape-lowering - Legalize Shape dialect to be convertible to Arith
--simplify-extract-strided-metadata - Simplify extract_strided_metadata ops
--snapshot-op-locations - Generate new locations from the current IR
--filename=<string> - The filename to print the generated IR
--tag=<string> - A tag to use when fusing the new locations with the original. If unset, the locations are replaced.
--sparse-buffer-rewrite - Rewrite sparse primitives on buffers to actual code
--sparse-tensor-codegen - Convert sparse tensors and primitives to actual code
--sparse-tensor-conversion - Convert sparse tensors and primitives to library calls
--s2s-strategy=<int> - Set the strategy for sparse-to-sparse conversion
--sparse-tensor-rewrite - Applies sparse tensor rewriting rules prior to sparsification
--enable-runtime-library - Enable runtime library for manipulating sparse tensors
--sparsification - Automatically generate sparse tensor code from sparse tensor types
--enable-runtime-library - Enable runtime library for manipulating sparse tensors
--enable-simd-index32 - Enable i32 indexing into vectors (for efficiency)
--enable-vla-vectorization - Enable vector length agnostic vectorization
--parallelization-strategy=<value> - Set the parallelization strategy
=none - Turn off sparse parallelization.
=dense-outer-loop - Enable dense outer loop sparse parallelization.
=any-storage-outer-loop - Enable sparse parallelization regardless of storage for the outer loop.
=dense-any-loop - Enable dense parallelization for any loop.
=any-storage-any-loop - Enable sparse parallelization for any storage and loop.
--vectorization-strategy=<value> - Set the vectorization strategy
=none - Turn off sparse vectorization.
=dense-inner-loop - Enable vectorization for dense inner loops.
=any-storage-inner-loop - Enable sparse vectorization for inner loops with any storage.
--vl=<int> - Set the vector length
--spirv-canonicalize-gl - Run canonicalization involving GLSL ops
--spirv-lower-abi-attrs - Decorate SPIR-V composite type with layout info
--spirv-rewrite-inserts - Rewrite sequential chains of spirv.CompositeInsert operations into spirv.CompositeConstruct operations
--spirv-unify-aliased-resource - Unify access of multiple aliased resources into access of one single resource
--spirv-update-vce - Deduce and attach minimal (version, capabilities, extensions) requirements to spirv.module ops
--strip-debuginfo - Strip debug info from all operations
--symbol-dce - Eliminate dead symbols
--symbol-privatize - Mark symbols private
--exclude=<string> - Comma separated list of symbols that should not be marked private
--symbolic-shape-optimization - Analyzes shapes and performs shape-related optimizations
--tensor-bufferize - Bufferize the `tensor` dialect
--tensor-copy-insertion - Make all tensor IR inplaceable by inserting copies
--allow-return-allocs - Allows returning/yielding new allocations from a block.
--bufferize-function-boundaries - Bufferize function boundaries (experimental).
--create-deallocs - Specify if new allocations should be deallocated.
--must-infer-memory-space - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
--tm-tensor-bufferize - Bufferize the TMTensor dialect
--tm-tensor-to-loops - Convert TMTensor ops to loops and Linalg ops.
--topological-sort - Sort regions without SSA dominance in topological order
--torch-adjust-calling-conventions - Adjust the calling conventions of functions
--torch-decompose-complex-ops - Decompose complicated torch operations
--legal-ops=<string> - List of operation names that should be considered legal
--torch-drop-shape-calculations - Drop reified shape calculations.
--torch-erase-module-initializer - Erase the `torch.global_slot.module_initializer` op.
--torch-finalizing-backend-type-conversion - Finalizes a partial conversion to builtin tensors
--torch-func-backend-type-conversion - Convert functions to operate on builtin tensors
--torch-globalize-object-graph - Converts TorchScript object graphs to a globalized form
--torch-inline-global-slots - Inlines torch.global_slot ops.
--torch-lower-to-backend-contract - Perform simplifications until the backend contract is satisfied.
--backend-legal-ops=<string> - List of ops to be considered legal for the backend.
--decompose - Decompose ops.
--max-iterations=<int> - Maximum number of invocations of the simplification pipeline.
--torch-maximize-value-semantics - Use value-semantic tensors where possible.
--torch-prepare-for-globalize-object-graph - Lowering in preparation for globalizing
--torch-reduce-op-variants - Reduces variants of ops to a smaller set of ops.
--torch-refine-public-return - Refine public return
--torch-refine-types - Refine types
--torch-reify-shape-calculations - Decompose complicated torch operations
--torch-simplify-shape-calculations - Simplify reified shape calculations.
--torch-verify-backend-contract - Check that program satisfies backend contract.
--torch-verify-linalg-on-tensors-backend-contract - Verifies conformity to the linalg-on-tensors backend contract
--torch-verify-mhlo-backend-contract - Verifies conformity to the mhlo backend contract
--torch-verify-tosa-backend-contract - Verifies conformity to the linalg-on-tensors backend contract
--tosa-infer-shapes - Propagate shapes across TOSA operations
--tosa-layerwise-constant-fold - Fold layerwise operations on constant tensors
--tosa-make-broadcastable - TOSA rank Reshape to enable Broadcasting
--tosa-optional-decompositions - Applies Tosa operations optional decompositions
--tosa-to-arith - Lower TOSA to the Arith dialect
--include-apply-rescale - Whether to include the lowering for tosa.apply_rescale to arith
--use-32-bit - Whether to prioritze lowering to 32-bit operations
--tosa-to-linalg - Lower TOSA to LinAlg on tensors
--tosa-to-linalg-named - Lower TOSA to LinAlg named operations
--tosa-to-scf - Lower TOSA to the SCF dialect
--tosa-to-tensor - Lower TOSA to the Tensor dialect
--transform-dialect-check-uses - warn about potential use-after-free in the transform dialect
--vector-bufferize - Bufferize Vector dialect ops
--view-op-graph - Print Graphviz visualization of an operation
--max-label-len=<uint> - Limit attribute/type length to number of chars
--print-attrs - Print attributes of operations
--print-control-flow-edges - Print control flow edges
--print-data-flow-edges - Print data flow edges
--print-result-types - Print result types of operations
Pass Pipelines:
--sparse-compiler - The standard pipeline for taking sparsity-agnostic IR using the sparse-tensor type, and lowering it to LLVM IR with concrete representations and algorithms for sparse tensors.
--enable-amx - Enables the use of AMX dialect while lowering the vector dialect.
--enable-arm-neon - Enables the use of ArmNeon dialect while lowering the vector dialect.
--enable-arm-sve - Enables the use of ArmSVE dialect while lowering the vector dialect.
--enable-index-optimizations - Allows compiler to assume indices fit in 32-bit if that yields faster code
--enable-runtime-library - Enable runtime library for manipulating sparse tensors
--enable-simd-index32 - Enable i32 indexing into vectors (for efficiency)
--enable-vla-vectorization - Enable vector length agnostic vectorization
--enable-x86vector - Enables the use of X86Vector dialect while lowering the vector dialect.
--parallelization-strategy=<value> - Set the parallelization strategy
=none - Turn off sparse parallelization.
=dense-outer-loop - Enable dense outer loop sparse parallelization.
=any-storage-outer-loop - Enable sparse parallelization regardless of storage for the outer loop.
=dense-any-loop - Enable dense parallelization for any loop.
=any-storage-any-loop - Enable sparse parallelization for any storage and loop.
--reassociate-fp-reductions - Allows llvm to reassociate floating-point reductions for speed
--s2s-strategy=<int> - Set the strategy for sparse-to-sparse conversion
--test-bufferization-analysis-only - Run only the inplacability analysis
--vectorization-strategy=<value> - Set the vectorization strategy
=none - Turn off sparse vectorization.
=dense-inner-loop - Enable vectorization for dense inner loops.
=any-storage-inner-loop - Enable sparse vectorization for inner loops with any storage.
--vl=<int> - Set the vector length
--torch-backend-to-linalg-on-tensors-backend-pipeline- Pipeline lowering torch backend contract to linalg-on-tensors backend contract.
--torch-backend-to-mhlo-backend-pipeline - Pipeline lowering torch backend contract to MHLO backend contract.
--enable-i32-index - Enable truncate index from i64 to i32(unsafely)
--enable-static-shape - Enable static shape conversion.
--torch-backend-to-tosa-backend-pipeline - Pipeline lowering torch backend contract to TOSA backend contract.
--torch-function-to-torch-backend-pipeline - Pipeline lowering a Torch function to Torch backend form.
--backend-legal-ops=<string> - List of ops to be considered legal for the backend.
--decompose-complex-ops - Decompose complex operations.
--max-iterations=<int> - Maximum number of invocations of the simplification pipeline.
--torch-shape-refinement-pipeline - Pipeline refining shapes of tensors.
--torch-simplification-pipeline - Pipeline simplifying computations in the program.
--backend-legal-ops=<string> - List of ops to be considered legal for the backend.
--decompose-complex-ops - Decompose complex operations.
--max-iterations=<int> - Maximum number of invocations of the simplification pipeline.
--torchscript-module-to-torch-backend-pipeline - Pipeline lowering TorchScript object graph IR to Torch backend form.
--backend-legal-ops=<string> - List of ops to be considered legal for the backend.
--decompose-complex-ops - Decompose complex operations.
--max-iterations=<int> - Maximum number of invocations of the simplification pipeline.
--show-dialects - Print the list of registered dialects
--split-input-file - Split the input file into pieces and process each chunk independently
--verify-diagnostics - Check that emitted diagnostics match expected-* lines on the corresponding line
--verify-each - Run the verifier after each transformation pass
Generic Options:
--help - Display available options (--help-hidden for more)
--help-list - Display list of available options (--help-list-hidden for more)
--version - Display the version of this program
➜ ~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment