AmosLewis · October 20, 2022 00:56
diff --git a/torch-mlir-opt_help.txt b/torch-mlir-opt_help.txt

 OVERVIEW: MLIR modular optimizer driver

 Available Dialects: acc, affine, amdgpu, amx, arith, arm_neon, arm_sve, async, bufferization, builtin, cf, complex, dlti, emitc, func, gpu, linalg, llvm, math, memref, ml_program, nvgpu, nvvm, omp, pdl, pdl_interp, quant, rocdl, scf, shape, sparse_tensor, spirv, tensor, tm_tensor, torch, torch_c, tosa, transform, vector, x86vector
 USAGE: torch-mlir-opt [options] <input file>

 OPTIONS:

 Color Options:

  --color                                                  - Use colors in output (default=autodetect)

 General options:

  --allow-unregistered-dialect                             - Allow operation with no registered dialects
  --disable-i2p-p2i-opt                                    - Disables inttoptr/ptrtoint roundtrip optimization
  --dot-cfg-mssa=<file name for generated dot file>        - file name for generated dot file
  --emit-bytecode                                          - Emit bytecode when generating output
  --generate-merged-base-profiles                          - When generating nested context-sensitive profiles, always generate extra base profile for function with all its context profiles merged into it.
  --mlir-debug-counter=<string>                            - Comma separated list of debug counter skip and count arguments
  --mlir-disable-threading                                 - Disable multi-threading within MLIR, overrides any further call to MLIRContext::enableMultiThreading()
  --mlir-elide-elementsattrs-if-larger=<uint>              - Elide ElementsAttrs with "..." that have more elements than the given upper limit
  --mlir-pass-pipeline-crash-reproducer=<string>           - Generate a .mlir reproducer file at the given output path if the pass manager crashes or fails
  --mlir-pass-pipeline-local-reproducer                    - When generating a crash reproducer, attempt to generated a reproducer with the smallest pipeline.
  --mlir-pass-statistics                                   - Display the statistics of each pass
  --mlir-pass-statistics-display=<value>                   - Display method for pass statistics
    =list                                                  -   display the results in a merged list sorted by pass name
    =pipeline                                              -   display the results with a nested pipeline view
  --mlir-pretty-debuginfo                                  - Print pretty debug info in MLIR output
  --mlir-print-debug-counter                               - Print out debug counter information after all counters have been accumulated
  --mlir-print-debuginfo                                   - Print debug info in MLIR output
  --mlir-print-elementsattrs-with-hex-if-larger=<long>     - Print DenseElementsAttrs with a hex string that have more elements than the given upper limit (use -1 to disable)
  --mlir-print-ir-after=<pass-arg>                         - Print IR after specified passes
  --mlir-print-ir-after-all                                - Print IR after each pass
  --mlir-print-ir-after-change                             - When printing the IR after a pass, only print if the IR changed
  --mlir-print-ir-after-failure                            - When printing the IR after a pass, only print if the pass failed
  --mlir-print-ir-before=<pass-arg>                        - Print IR before specified passes
  --mlir-print-ir-before-all                               - Print IR before each pass
  --mlir-print-ir-module-scope                             - When printing IR for print-ir-[before|after]{-all} always print the top-level operation
  --mlir-print-local-scope                                 - Print with local scope and inline information (eliding aliases for attributes, types, and locations
  --mlir-print-op-on-diagnostic                            - When a diagnostic is emitted on an operation, also print the operation as an attached note
  --mlir-print-stacktrace-on-diagnostic                    - When a diagnostic is emitted, also print the stack trace as an attached note
  --mlir-print-value-users                                 - Print users of operation results and block arguments as a comment
  --mlir-timing                                            - Display execution times
  --mlir-timing-display=<value>                            - Display method for timing data
    =list                                                  -   display the results in a list sorted by total time
    =tree                                                  -   display the results ina with a nested tree view
  --no-implicit-module                                     - Disable implicit addition of a top-level module op during parsing
  -o <filename>                                            - Output filename
  --opaque-pointers                                        - Use opaque pointers
  Compiler passes to run
    --pass-pipeline                                        -   A textual description of a pass pipeline to run
    Passes:
      --affine-data-copy-generate                          -   Generate explicit copying for affine memory operations
        --fast-mem-capacity=<ulong>                        - Set fast memory space capacity in KiB (default: unlimited)
        --fast-mem-space=<uint>                            - Fast memory space identifier for copy generation (default: 1)
        --generate-dma                                     - Generate DMA instead of point-wise copy
        --min-dma-transfer=<int>                           - Minimum DMA transfer size supported by the target in bytes
        --skip-non-unit-stride-loops                       - Testing purposes: avoid non-unit stride loop choice depths for copy placement
        --slow-mem-space=<uint>                            - Slow memory space identifier for copy generation (default: 0)
        --tag-mem-space=<uint>                             - Tag memory space identifier for copy generation (default: 0)
      --affine-expand-index-ops                            -   Lower affine operations operating on indices into more fundamental operations
      --affine-loop-coalescing                             -   Coalesce nested loops with independent bounds into a single loop
      --affine-loop-fusion                                 -   Fuse affine loop nests
        --fusion-compute-tolerance=<number>                - Fractional increase in additional computation tolerated while fusing
        --fusion-fast-mem-space=<uint>                     - Faster memory space number to promote fusion buffers to
        --fusion-local-buf-threshold=<ulong>               - Threshold size (KiB) for promoting local buffers to fast memory space
        --fusion-maximal                                   - Enables maximal loop fusion
        --mode=<value>                                     - fusion mode to attempt
    =greedy                                          -   Perform greedy (both producer-consumer and sibling)  fusion
    =producer                                        -   Perform only producer-consumer fusion
    =sibling                                         -   Perform only sibling fusion
      --affine-loop-invariant-code-motion                  -   Hoist loop invariant instructions outside of affine loops
      --affine-loop-normalize                              -   Apply normalization transformations to affine loop-like ops
      --affine-loop-tile                                   -   Tile affine loop nests
        --cache-size=<ulong>                               - Set size of cache to tile for in KiB (default: 512)
        --separate                                         - Separate full and partial tiles (default: false)
        --tile-size=<uint>                                 - Use this tile size for all loops
        --tile-sizes=<uint>                                - List of tile sizes for each perfect nest (overridden by -tile-size)
      --affine-loop-unroll                                 -   Unroll affine loops
        --cleanup-unroll                                   - Fully unroll the cleanup loop when possible.
        --unroll-factor=<uint>                             - Use this unroll factor for all loops being unrolled
        --unroll-full                                      - Fully unroll loops
        --unroll-full-threshold=<uint>                     - Unroll all loops with trip count less than or equal to this
        --unroll-num-reps=<uint>                           - Unroll innermost loops repeatedly this many times
        --unroll-up-to-factor                              - Allow unrolling up to the factor specified
      --affine-loop-unroll-jam                             -   Unroll and jam affine loops
        --unroll-jam-factor=<uint>                         - Use this unroll jam factor for all loops (default 4)
      --affine-parallelize                                 -   Convert affine.for ops into 1-D affine.parallel
        --max-nested=<uint>                                - Maximum number of nested parallel loops to produce. Defaults to unlimited (UINT_MAX).
        --parallel-reductions                              - Whether to parallelize reduction loops. Defaults to false.
      --affine-pipeline-data-transfer                      -   Pipeline non-blocking data transfers between explicitly managed levels of the memory hierarchy
      --affine-scalrep                                     -   Replace affine memref acceses by scalars by forwarding stores to loads and eliminating redundant loads
      --affine-simplify-structures                         -   Simplify affine expressions in maps/sets and normalize memrefs
      --affine-super-vectorize                             -   Vectorize to a target independent n-D vector abstraction
        --test-fastest-varying=<long>                      - Specify a 1-D, 2-D or 3-D pattern of fastest varying memory dimensions to match. See defaultPatterns in Vectorize.cpp for a description and examples. This is used for testing purposes
        --vectorize-reductions                             - Vectorize known reductions expressed via iter_args. Switched off by default.
        --virtual-vector-size=<long>                       - Specify an n-D virtual vector size for vectorization
      --arith-bufferize                                    -   Bufferize Arith dialect ops.
        --alignment=<uint>                                 - Create global memrefs with a specified alignment
      --arith-emulate-wide-int                             -   Emulate 2*N-bit integer operations using N-bit operations
        --widest-int-supported=<uint>                      - Widest integer type supported by the target
      --arith-expand                                       -   Legalize Arith ops to be convertible to LLVM.
      --arith-unsigned-when-equivalent                     -   Replace signed ops with unsigned ones where they are proven equivalent
      --arm-neon-2d-to-intr                                -   Convert Arm NEON structured ops to intrinsics
      --async-parallel-for                                 -   Convert scf.parallel operations to multiple async compute ops executed concurrently for non-overlapping iteration ranges
        --async-dispatch                                   - Dispatch async compute tasks using recursive work splitting. If `false` async compute tasks will be launched using simple for loop in the caller thread.
        --min-task-size=<int>                              - The minimum task size for sharding parallel operation.
        --num-workers=<int>                                - The number of available workers to execute async operations. If `-1` the value will be retrieved from the runtime.
      --async-runtime-policy-based-ref-counting            -   Policy based reference counting for Async runtime operations
      --async-runtime-ref-counting                         -   Automatic reference counting for Async runtime operations
      --async-runtime-ref-counting-opt                     -   Optimize automatic reference counting operations for theAsync runtime by removing redundant operations
      --async-to-async-runtime                             -   Lower high level async operations (e.g. async.execute) to theexplicit async.runtime and async.coro operations
        --eliminate-blocking-await-ops                     - Rewrite functions with blocking async.runtime.await as coroutines with async.runtime.await_and_resume.
      --buffer-deallocation                                -   Adds all required dealloc operations for all allocations in the input program
      --buffer-hoisting                                    -   Optimizes placement of allocation operations by moving them into common dominators and out of nested regions
      --buffer-loop-hoisting                               -   Optimizes placement of allocation operations by moving them out of loop nests
      --buffer-results-to-out-params                       -   Converts memref-typed function results to out-params
      --bufferization-bufferize                            -   Bufferize the `bufferization` dialect
      --canonicalize                                       -   Canonicalize operations
        --disable-patterns=<string>                        - Labels of patterns that should be filtered out during application
        --enable-patterns=<string>                         - Labels of patterns that should be used during application, all other patterns are filtered out
        --max-iterations=<long>                            - Seed the worklist in general top-down order
        --region-simplify                                  - Seed the worklist in general top-down order
        --top-down                                         - Seed the worklist in general top-down order
      --control-flow-sink                                  -   Sink operations into conditional blocks
      --convert-affine-for-to-gpu                          -   Convert top-level AffineFor Ops to GPU kernels
        --gpu-block-dims=<uint>                            - Number of GPU block dimensions for mapping
        --gpu-thread-dims=<uint>                           - Number of GPU thread dimensions for mapping
      --convert-amdgpu-to-rocdl                            -   Convert AMDGPU dialect to ROCDL dialect
        --chipset=<string>                                 - Chipset that these operations will run on
      --convert-arith-to-llvm                              -   Convert Arith dialect to LLVM dialect
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
      --convert-arith-to-spirv                             -   Convert Arith dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types                  - Emulate non-32-bit scalar types with 32-bit ones if missing native support
        --enable-fast-math                                 - Enable fast math mode (assuming no NaN and infinity for floating point values) when performing conversion
      --convert-async-to-llvm                              -   Convert the operations from the async dialect into the LLVM dialect
      --convert-bufferization-to-memref                    -   Convert operations from the Bufferization dialect to the MemRef dialect
      --convert-cf-to-llvm                                 -   Convert ControlFlow operations to the LLVM dialect
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
      --convert-cf-to-spirv                                -   Convert ControlFlow dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types                  - Emulate non-32-bit scalar types with 32-bit ones if missing native support
      --convert-complex-to-libm                            -   Convert Complex dialect to libm calls
      --convert-complex-to-llvm                            -   Convert Complex dialect to LLVM dialect
      --convert-complex-to-standard                        -   Convert Complex dialect to standard dialect
      --convert-elementwise-to-linalg                      -   Convert ElementwiseMappable ops to linalg
      --convert-func-to-llvm                               -   Convert from the Func dialect to the LLVM dialect
        --data-layout=<string>                             - String description (LLVM format) of the data layout that is expected on the produced module
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
        --use-bare-ptr-memref-call-conv                    - Replace FuncOp's MemRef arguments with bare pointers to the MemRef element types
      --convert-func-to-spirv                              -   Convert Func dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types                  - Emulate non-32-bit scalar types with 32-bit ones if missing native support
      --convert-gpu-launch-to-vulkan-launch                -   Convert gpu.launch_func to vulkanLaunch external call
      --convert-gpu-to-nvvm                                -   Generate NVVM operations for gpu operations
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
      --convert-gpu-to-rocdl                               -   Generate ROCDL operations for gpu operations
        --chipset=<string>                                 - Chipset that these operations will run on
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
        --runtime=<value>                                  - Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)
    =unknown                                         -   Unknown (default)
    =HIP                                             -   HIP
    =OpenCL                                          -   OpenCL
        --use-bare-ptr-memref-call-conv                    - Replace memref arguments in GPU functions with bare pointers.All memrefs must have static shape
      --convert-gpu-to-spirv                               -   Convert GPU dialect to SPIR-V dialect
      --convert-linalg-to-affine-loops                     -   Lower the operations from the linalg dialect into affine loops
      --convert-linalg-to-llvm                             -   Convert the operations from the linalg dialect into the LLVM dialect
      --convert-linalg-to-loops                            -   Lower the operations from the linalg dialect into loops
      --convert-linalg-to-parallel-loops                   -   Lower the operations from the linalg dialect into parallel loops
      --convert-linalg-to-spirv                            -   Convert Linalg dialect to SPIR-V dialect
      --convert-linalg-to-std                              -   Convert the operations from the linalg dialect into the Standard dialect
      --convert-math-to-funcs                              -   Convert Math operations to calls of outlined implementations.
      --convert-math-to-libm                               -   Convert Math dialect to libm calls
      --convert-math-to-llvm                               -   Convert Math dialect to LLVM dialect
      --convert-math-to-spirv                              -   Convert Math dialect to SPIR-V dialect
      --convert-memref-to-llvm                             -   Convert operations from the MemRef dialect to the LLVM dialect
        --index-bitwidth=<uint>                            - Bitwidth of the index type, 0 to use size of machine word
        --use-aligned-alloc                                - Use aligned_alloc in place of malloc for heap allocations
        --use-generic-functions                            - Use generic allocation and deallocation functions instead of the classic 'malloc', 'aligned_alloc' and 'free' functions
      --convert-memref-to-spirv                            -   Convert MemRef dialect to SPIR-V dialect
        --bool-num-bits=<int>                              - The number of bits to store a boolean value
      --convert-nvgpu-to-nvvm                              -   Convert NVGPU dialect to NVVM dialect
      --convert-openacc-to-llvm                            -   Convert the OpenACC ops to LLVM dialect
      --convert-openacc-to-scf                             -   Convert the OpenACC ops to OpenACC with SCF dialect
      --convert-openmp-to-llvm                             -   Convert the OpenMP ops to OpenMP ops with LLVM dialect
      --convert-parallel-loops-to-gpu                      -   Convert mapped scf.parallel ops to gpu launch operations
      --convert-pdl-to-pdl-interp                          -   Convert PDL ops to PDL interpreter ops
      --convert-scf-to-cf                                  -   Convert SCF dialect to ControlFlow dialect, replacing structured control flow with a CFG
      --convert-scf-to-openmp                              -   Convert SCF parallel loop to OpenMP parallel + workshare constructs.
      --convert-scf-to-spirv                               -   Convert SCF dialect to SPIR-V dialect.
      --convert-shape-constraints                          -   Convert shape constraint operations to the standard dialect
      --convert-shape-to-std                               -   Convert operations from the shape dialect into the standard dialect
      --convert-spirv-to-llvm                              -   Convert SPIR-V dialect to LLVM dialect
      --convert-tensor-to-linalg                           -   Convert some Tensor dialect ops to Linalg dialect
      --convert-tensor-to-spirv                            -   Convert Tensor dialect to SPIR-V dialect
        --emulate-non-32-bit-scalar-types                  - Emulate non-32-bit scalar types with 32-bit ones if missing native support
      --convert-torch-to-arith                             -   Convert recognized Torch ops to Std ops
      --convert-torch-to-linalg                            -   Convert recognized Torch ops to Linalg ops
      --convert-torch-to-mhlo                              -   Convert Torch ops to MHLO ops
        --enable-i32-index                                 - Enable truncate index from i64 to i32(unsafely)
        --enable-static-shape                              - Enable static shape conversion
      --convert-torch-to-scf                               -   Convert recognized Torch ops to SCF ops
      --convert-torch-to-tmtensor                          -   Convert recognized Torch ops to TMTensor/Linalg ops
      --convert-torch-to-tosa                              -   Convert Torch ops to TOSA ops
      --convert-vector-to-gpu                              -   Lower the operations from the vector dialect into the GPU dialect
        --use-nvgpu                                        - convert to NvGPU ops instead of GPU dialect ops
      --convert-vector-to-llvm                             -   Lower the operations from the vector dialect into the LLVM dialect
        --enable-amx                                       - Enables the use of AMX dialect while lowering the vector dialect.
        --enable-arm-neon                                  - Enables the use of ArmNeon dialect while lowering the vector dialect.
        --enable-arm-sve                                   - Enables the use of ArmSVE dialect while lowering the vector dialect.
        --enable-x86vector                                 - Enables the use of X86Vector dialect while lowering the vector dialect.
        --force-32bit-vector-indices                       - Allows compiler to assume vector indices fit in 32-bit if that yields faster code
        --reassociate-fp-reductions                        - Allows llvm to reassociate floating-point reductions for speed
      --convert-vector-to-scf                              -   Lower the operations from the vector dialect into the SCF dialect
        --full-unroll                                      - Perform full unrolling when converting vector transfers to SCF
        --lower-permutation-maps                           - Replace permutation maps with vector transposes/broadcasts before lowering transfer ops
        --lower-tensors                                    - Lower transfer ops that operate on tensors
        --target-rank=<uint>                               - Target vector rank to which transfer ops should be lowered
      --convert-vector-to-spirv                            -   Convert Vector dialect to SPIR-V dialect
      --cse                                                -   Eliminate common sub-expressions
      --decorate-spirv-composite-type-layout               -   Decorate SPIR-V composite type with layout info
      --drop-equivalent-buffer-results                     -   Remove MemRef return values that are equivalent to a bbArg
      --eliminate-alloc-tensors                            -   Try to eliminate all alloc_tensor ops.
      --empty-tensor-to-alloc-tensor                       -   Replace all empty ops by alloc_tensor ops.
      --finalizing-bufferize                               -   Finalize a partial bufferization
      --fold-memref-alias-ops                              -   Fold memref alias ops into consumer load/store ops
      --func-bufferize                                     -   Bufferize func/call/return ops
      --gpu-async-region                                   -   Make GPU ops async
      --gpu-kernel-outlining                               -   Outline gpu.launch bodies to kernel functions
        --data-layout-str=<string>                         - String containing the data layout specification to be attached to the GPU kernel module
      --gpu-launch-sink-index-computations                 -   Sink index computations into gpu.launch body
      --gpu-map-parallel-loops                             -   Greedily maps loops to GPU hardware dimensions.
      --gpu-to-llvm                                        -   Convert GPU dialect to LLVM dialect with GPU runtime calls
        --gpu-binary-annotation=<string>                   - Annotation attribute string for GPU binary
        --use-bare-pointers-for-kernels                    - Use bare pointers to pass memref arguments to kernels. The kernel must use the same setting for this option.
      --hlo-legalize-to-linalg                             -   Legalize from HLO dialect to Linalg dialect.
      --inline                                             -   Inline function calls
        --default-pipeline=<string>                        - The default optimizer pipeline used for callables
        --max-iterations=<uint>                            - Maximum number of iterations when inlining within an SCC
        --op-pipelines=<pass-manager>                      - Callable operation specific optimizer pipelines (in the form of `dialect.op(pipeline)`)
      --launch-func-to-vulkan                              -   Convert vulkanLaunch external call to Vulkan runtime external calls
      --linalg-bufferize                                   -   Bufferize the linalg dialect
      --linalg-detensorize                                 -   Detensorize linalg ops
        --aggressive-mode                                  - Detensorize all ops that qualify for detensoring along with branch operands and basic-block arguments.
      --linalg-fold-unit-extent-dims                       -   Remove unit-extent dimension in Linalg ops on tensors
        --fold-one-trip-loops-only                         - Only folds the one-trip loops from Linalg ops on tensors (for testing purposes only)
      --linalg-fuse-elementwise-ops                        -   Fuse elementwise operations on tensors
      --linalg-generalize-named-ops                        -   Convert named ops into generic ops
      --linalg-inline-scalar-operands                      -   Inline scalar operands into linalg generic ops
      --linalg-named-op-conversion                         -   Convert from one named linalg op to another.
      --llvm-legalize-for-export                           -   Legalize LLVM dialect to be convertible to LLVM IR
      --llvm-optimize-for-nvvm-target                      -   Optimize NVVM IR
      --llvm-request-c-wrappers                            -   Request C wrapper emission for all functions
      --loop-invariant-code-motion                         -   Hoist loop invariant instructions outside of the loop
      --lower-affine                                       -   Lower Affine operations to a combination of Standard and SCF operations
      --lower-host-to-llvm                                 -   Lowers the host module code and `gpu.launch_func` to LLVM
      --map-memref-spirv-storage-class                     -   Map numeric MemRef memory spaces to SPIR-V storage classes
        --client-api=<string>                              - The client API to use for populating mappings
      --memref-emulate-wide-int                            -   Emulate 2*N-bit integer operations using N-bit operations
        --widest-int-supported=<uint>                      - Widest integer type supported by the target
      --memref-expand                                      -   Legalize memref operations to be convertible to LLVM.
      --normalize-memrefs                                  -   Normalize memrefs
      --nvgpu-optimize-shared-memory                       -   Optimizes accesses to shard memory memrefs in order to reduce bank conflicts.
      --one-shot-bufferize                                 -   One-Shot Bufferize
        --allow-return-allocs                              - Allows returning/yielding new allocations from a block.
        --allow-unknown-ops                                - Allows unknown (not bufferizable) ops in the input IR.
        --analysis-fuzzer-seed=<uint>                      - Test only: Analyze ops in random order with a given seed (fuzzer)
        --analysis-heuristic=<string>                      - Heuristic that control the IR traversal during analysis
        --bufferize-function-boundaries                    - Bufferize function boundaries (experimental).
        --copy-before-write                                - Skip the analysis. Make a buffer copy on every write.
        --create-deallocs                                  - Specify if buffers should be deallocated. For compatibility with core bufferization passes.
        --dialect-filter=<string>                          - Restrict bufferization to ops from these dialects.
        --function-boundary-type-conversion=<string>       - Controls layout maps when bufferizing function signatures.
        --must-infer-memory-space                          - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
        --print-conflicts                                  - Test only: Annotate IR with RaW conflicts. Requires test-analysis-only.
        --test-analysis-only                               - Test only: Only run inplaceability analysis and annotate IR
        --unknown-type-conversion=<string>                 - Controls layout maps for non-inferrable memref types.
      --outline-shape-computation                          -   Using shape.func to preserve shape computation
      --print-op-stats                                     -   Print statistics of operations
        --json                                             - print the stats as JSON
      --promote-buffers-to-stack                           -   Promotes heap-based allocations to automatically managed stack-based allocations
        --max-alloc-size-in-bytes=<uint>                   - Maximal size in bytes to promote allocations to stack.
        --max-rank-of-allocated-memref=<uint>              - Maximal memref rank to promote dynamic buffers.
      --reconcile-unrealized-casts                         -   Simplify and eliminate unrealized conversion casts
      --refback-expand-ops-for-llvm                        -   Expand ops into more primitive ops before LLVM lowering.
      --refback-generalize-tensor-pad                      -   Convert tensor.pad to linalg ops
      --refback-insert-rng-globals                         -   Insert global variables and sequence to get the next global seed for RNG ops
      --refback-munge-calling-conventions                  -   Munge calling conventions for calling via ExecutionEngine
      --refback-munge-memref-copy                          -   Munge memref.copy to linalg.copy
      --remove-shape-constraints                           -   Replace all cstr_ ops with a true witness
      --resolve-ranked-shaped-type-result-dims             -   Resolve memref.dim of result values of ranked shape type
      --resolve-shaped-type-result-dims                    -   Resolve memref.dim of result values
      --sccp                                               -   Sparse Conditional Constant Propagation
      --scf-bufferize                                      -   Bufferize the scf dialect.
      --scf-for-loop-canonicalization                      -   Canonicalize operations within scf.for loop bodies
      --scf-for-loop-peeling                               -   Peel `for` loops at their upper bounds.
        --skip-partial                                     - Do not peel loops inside of the last, partial iteration of another already peeled loop.
      --scf-for-loop-range-folding                         -   Fold add/mul ops into loop range
      --scf-for-loop-specialization                        -   Specialize `for` loops for vectorization
      --scf-for-to-while                                   -   Convert SCF for loops to SCF while loops
      --scf-parallel-loop-collapsing                       -   Collapse parallel loops to use less induction variables
        --collapsed-indices-0=<uint>                       - Which loop indices to combine 0th loop index
        --collapsed-indices-1=<uint>                       - Which loop indices to combine into the position 1 loop index
        --collapsed-indices-2=<uint>                       - Which loop indices to combine into the position 2 loop index
      --scf-parallel-loop-fusion                           -   Fuse adjacent parallel loops
      --scf-parallel-loop-specialization                   -   Specialize parallel loops for vectorization
      --scf-parallel-loop-tiling                           -   Tile parallel loops
        --no-min-max-bounds                                - Perform tiling with fixed upper bound with inbound check inside the internal loops
        --parallel-loop-tile-sizes=<long>                  - Factors to tile parallel loops by
      --shape-bufferize                                    -   Bufferize the shape dialect.
      --shape-to-shape-lowering                            -   Legalize Shape dialect to be convertible to Arith
      --simplify-extract-strided-metadata                  -   Simplify extract_strided_metadata ops
      --snapshot-op-locations                              -   Generate new locations from the current IR
        --filename=<string>                                - The filename to print the generated IR
        --tag=<string>                                     - A tag to use when fusing the new locations with the original. If unset, the locations are replaced.
      --sparse-buffer-rewrite                              -   Rewrite sparse primitives on buffers to actual code
      --sparse-tensor-codegen                              -   Convert sparse tensors and primitives to actual code
      --sparse-tensor-conversion                           -   Convert sparse tensors and primitives to library calls
        --s2s-strategy=<int>                               - Set the strategy for sparse-to-sparse conversion
      --sparse-tensor-rewrite                              -   Applies sparse tensor rewriting rules prior to sparsification
        --enable-runtime-library                           - Enable runtime library for manipulating sparse tensors
      --sparsification                                     -   Automatically generate sparse tensor code from sparse tensor types
        --enable-runtime-library                           - Enable runtime library for manipulating sparse tensors
        --enable-simd-index32                              - Enable i32 indexing into vectors (for efficiency)
        --enable-vla-vectorization                         - Enable vector length agnostic vectorization
        --parallelization-strategy=<value>                 - Set the parallelization strategy
    =none                                            -   Turn off sparse parallelization.
    =dense-outer-loop                                -   Enable dense outer loop sparse parallelization.
    =any-storage-outer-loop                          -   Enable sparse parallelization regardless of storage for the outer loop.
    =dense-any-loop                                  -   Enable dense parallelization for any loop.
    =any-storage-any-loop                            -   Enable sparse parallelization for any storage and loop.
        --vectorization-strategy=<value>                   - Set the vectorization strategy
    =none                                            -   Turn off sparse vectorization.
    =dense-inner-loop                                -   Enable vectorization for dense inner loops.
    =any-storage-inner-loop                          -   Enable sparse vectorization for inner loops with any storage.
        --vl=<int>                                         - Set the vector length
      --spirv-canonicalize-gl                              -   Run canonicalization involving GLSL ops
      --spirv-lower-abi-attrs                              -   Decorate SPIR-V composite type with layout info
      --spirv-rewrite-inserts                              -   Rewrite sequential chains of spirv.CompositeInsert operations into spirv.CompositeConstruct operations
      --spirv-unify-aliased-resource                       -   Unify access of multiple aliased resources into access of one single resource
      --spirv-update-vce                                   -   Deduce and attach minimal (version, capabilities, extensions) requirements to spirv.module ops
      --strip-debuginfo                                    -   Strip debug info from all operations
      --symbol-dce                                         -   Eliminate dead symbols
      --symbol-privatize                                   -   Mark symbols private
        --exclude=<string>                                 - Comma separated list of symbols that should not be marked private
      --symbolic-shape-optimization                        -   Analyzes shapes and performs shape-related optimizations
      --tensor-bufferize                                   -   Bufferize the `tensor` dialect
      --tensor-copy-insertion                              -   Make all tensor IR inplaceable by inserting copies
        --allow-return-allocs                              - Allows returning/yielding new allocations from a block.
        --bufferize-function-boundaries                    - Bufferize function boundaries (experimental).
        --create-deallocs                                  - Specify if new allocations should be deallocated.
        --must-infer-memory-space                          - The memory space of an memref types must always be inferred. If unset, a default memory space of 0 is used otherwise.
      --tm-tensor-bufferize                                -   Bufferize the TMTensor dialect
      --tm-tensor-to-loops                                 -   Convert TMTensor ops to loops and Linalg ops.
      --topological-sort                                   -   Sort regions without SSA dominance in topological order
      --torch-adjust-calling-conventions                   -   Adjust the calling conventions of functions
      --torch-decompose-complex-ops                        -   Decompose complicated torch operations
        --legal-ops=<string>                               - List of operation names that should be considered legal
      --torch-drop-shape-calculations                      -   Drop reified shape calculations.
      --torch-erase-module-initializer                     -   Erase the `torch.global_slot.module_initializer` op.
      --torch-finalizing-backend-type-conversion           -   Finalizes a partial conversion to builtin tensors
      --torch-func-backend-type-conversion                 -   Convert functions to operate on builtin tensors
      --torch-globalize-object-graph                       -   Converts TorchScript object graphs to a globalized form
      --torch-inline-global-slots                          -   Inlines torch.global_slot ops.
      --torch-lower-to-backend-contract                    -   Perform simplifications until the backend contract is satisfied.
        --backend-legal-ops=<string>                       - List of ops to be considered legal for the backend.
        --decompose                                        - Decompose ops.
        --max-iterations=<int>                             - Maximum number of invocations of the simplification pipeline.
      --torch-maximize-value-semantics                     -   Use value-semantic tensors where possible.
      --torch-prepare-for-globalize-object-graph           -   Lowering in preparation for globalizing
      --torch-reduce-op-variants                           -   Reduces variants of ops to a smaller set of ops.
      --torch-refine-public-return                         -   Refine public return
      --torch-refine-types                                 -   Refine types
      --torch-reify-shape-calculations                     -   Decompose complicated torch operations
      --torch-simplify-shape-calculations                  -   Simplify reified shape calculations.
      --torch-verify-backend-contract                      -   Check that program satisfies backend contract.
      --torch-verify-linalg-on-tensors-backend-contract    -   Verifies conformity to the linalg-on-tensors backend contract
      --torch-verify-mhlo-backend-contract                 -   Verifies conformity to the mhlo backend contract
      --torch-verify-tosa-backend-contract                 -   Verifies conformity to the linalg-on-tensors backend contract
      --tosa-infer-shapes                                  -   Propagate shapes across TOSA operations
      --tosa-layerwise-constant-fold                       -   Fold layerwise operations on constant tensors
      --tosa-make-broadcastable                            -   TOSA rank Reshape to enable Broadcasting
      --tosa-optional-decompositions                       -   Applies Tosa operations optional decompositions
      --tosa-to-arith                                      -   Lower TOSA to the Arith dialect
        --include-apply-rescale                            - Whether to include the lowering for tosa.apply_rescale to arith
        --use-32-bit                                       - Whether to prioritze lowering to 32-bit operations
      --tosa-to-linalg                                     -   Lower TOSA to LinAlg on tensors
      --tosa-to-linalg-named                               -   Lower TOSA to LinAlg named operations
      --tosa-to-scf                                        -   Lower TOSA to the SCF dialect
      --tosa-to-tensor                                     -   Lower TOSA to the Tensor dialect
      --transform-dialect-check-uses                       -   warn about potential use-after-free in the transform dialect
      --vector-bufferize                                   -   Bufferize Vector dialect ops
      --view-op-graph                                      -   Print Graphviz visualization of an operation
        --max-label-len=<uint>                             - Limit attribute/type length to number of chars
        --print-attrs                                      - Print attributes of operations
        --print-control-flow-edges                         - Print control flow edges
        --print-data-flow-edges                            - Print data flow edges
        --print-result-types                               - Print result types of operations
    Pass Pipelines:
      --sparse-compiler                                    -   The standard pipeline for taking sparsity-agnostic IR using the sparse-tensor type, and lowering it to LLVM IR with concrete representations and algorithms for sparse tensors.
        --enable-amx                                       - Enables the use of AMX dialect while lowering the vector dialect.
        --enable-arm-neon                                  - Enables the use of ArmNeon dialect while lowering the vector dialect.
        --enable-arm-sve                                   - Enables the use of ArmSVE dialect while lowering the vector dialect.
        --enable-index-optimizations                       - Allows compiler to assume indices fit in 32-bit if that yields faster code
        --enable-runtime-library                           - Enable runtime library for manipulating sparse tensors
        --enable-simd-index32                              - Enable i32 indexing into vectors (for efficiency)
        --enable-vla-vectorization                         - Enable vector length agnostic vectorization
        --enable-x86vector                                 - Enables the use of X86Vector dialect while lowering the vector dialect.
        --parallelization-strategy=<value>                 - Set the parallelization strategy
    =none                                            -   Turn off sparse parallelization.
    =dense-outer-loop                                -   Enable dense outer loop sparse parallelization.
    =any-storage-outer-loop                          -   Enable sparse parallelization regardless of storage for the outer loop.
    =dense-any-loop                                  -   Enable dense parallelization for any loop.
    =any-storage-any-loop                            -   Enable sparse parallelization for any storage and loop.
        --reassociate-fp-reductions                        - Allows llvm to reassociate floating-point reductions for speed
        --s2s-strategy=<int>                               - Set the strategy for sparse-to-sparse conversion
        --test-bufferization-analysis-only                 - Run only the inplacability analysis
        --vectorization-strategy=<value>                   - Set the vectorization strategy
    =none                                            -   Turn off sparse vectorization.
    =dense-inner-loop                                -   Enable vectorization for dense inner loops.
    =any-storage-inner-loop                          -   Enable sparse vectorization for inner loops with any storage.
        --vl=<int>                                         - Set the vector length
      --torch-backend-to-linalg-on-tensors-backend-pipeline-   Pipeline lowering torch backend contract to linalg-on-tensors backend contract.
      --torch-backend-to-mhlo-backend-pipeline             -   Pipeline lowering torch backend contract to MHLO backend contract.
        --enable-i32-index                                 - Enable truncate index from i64 to i32(unsafely)
        --enable-static-shape                              - Enable static shape conversion.
      --torch-backend-to-tosa-backend-pipeline             -   Pipeline lowering torch backend contract to TOSA backend contract.
      --torch-function-to-torch-backend-pipeline           -   Pipeline lowering a Torch function to Torch backend form.
        --backend-legal-ops=<string>                       - List of ops to be considered legal for the backend.
        --decompose-complex-ops                            - Decompose complex operations.
        --max-iterations=<int>                             - Maximum number of invocations of the simplification pipeline.
      --torch-shape-refinement-pipeline                    -   Pipeline refining shapes of tensors.
      --torch-simplification-pipeline                      -   Pipeline simplifying computations in the program.
        --backend-legal-ops=<string>                       - List of ops to be considered legal for the backend.
        --decompose-complex-ops                            - Decompose complex operations.
        --max-iterations=<int>                             - Maximum number of invocations of the simplification pipeline.
      --torchscript-module-to-torch-backend-pipeline       -   Pipeline lowering TorchScript object graph IR to Torch backend form.
        --backend-legal-ops=<string>                       - List of ops to be considered legal for the backend.
        --decompose-complex-ops                            - Decompose complex operations.
        --max-iterations=<int>                             - Maximum number of invocations of the simplification pipeline.
  --show-dialects                                          - Print the list of registered dialects
  --split-input-file                                       - Split the input file into pieces and process each chunk independently
  --verify-diagnostics                                     - Check that emitted diagnostics match expected-* lines on the corresponding line
  --verify-each                                            - Run the verifier after each transformation pass

 Generic Options:

  --help                                                   - Display available options (--help-hidden for more)
  --help-list                                              - Display list of available options (--help-list-hidden for more)
  --version                                                - Display the version of this program
 ➜  ~