Skip to content

Instantly share code, notes, and snippets.

// -----// IR Dump After CSE (cse) //----- //
func.func @__builtin_fill_i64(%arg0: !util.buffer, %arg1: !util.buffer, %arg2: !util.list<!util.buffer>, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32, %arg10: i32, %arg11: i32) attributes {translation_info = #iree_codegen.translation_info<pipeline = VMVXDefault>} {
%c12 = arith.constant 12 : index
%c8 = arith.constant 8 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%c1 = arith.constant 1 : index
%buffer_size = util.buffer.size %arg1 : !util.buffer
%0 = util.buffer.load %arg1[%c0 for %c4] : !util.buffer{%buffer_size} -> i32
@bjacob
bjacob / README.md
Created March 4, 2025 20:04
Each hardware-specific microscaling format is a different quantization scheme

Each hardware-specific microscaling format is a different quantization scheme

This note is about what we can expect once workloads get optimized for microscaling.

Microscaling is about new hardware having new matrix-multiplications on small-bit-depth operands, plus separate scale factors. For instance, there is going to be a FP8 matrix multiplication instruction, accumulating in FP32, with additional "scale" FP32 operands applied as multipliers on the FP8 inputs just before multiply-accumulating them. There are also going to be new microscaling instructions for other small-bit-width floating-point and integer types.

Different hardware has always had different tile sizes. Different hardware has also supported different element types, but that mostly meant that other vendors caught up to the element types supported by each other. Once the same element type was supported, the differences in tile sizes were layout differ

 tools/iree-opt --iree-hal-conversion ~/b.mlir -debug 2>~/log.mlir

Why am I getting this error when building with this diff?

➜  iree-build ninja
[0/2] Re-checking globbed directories...
[181/317] Generating check_llvm-cpu_local-task_generic_success.mlir_module.vmfb from success.mlir
FAILED: runtime/src/iree/modules/check/test/check_llvm-cpu_local-task_generic_success.mlir_module.vmfb /home/benjacob/iree-build/runtime/src/iree/modules/check/test/check_llvm-cpu_local-task_generic_success.mlir_module.vmfb 
cd /home/benjacob/iree-build/runtime/src/iree/modules/check/test && /home/benjacob/iree-build/tools/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=generic /home/benjacob/iree/runtime/src/iree/modules/check/test/success.mlir -o check_llvm-cpu_local-task_generic_success.mlir_module.vmfb --iree-hal-executable-object-search-path=\"/home/benjacob/iree-build\" --iree-llvmcpu-embedded-linker-path=\"/home/benjacob/iree-build/llvm-project/bin/lld\" --iree-llvmcpu-wasm-linker-path=\"/home/b
diff --git a/home/benjacob/interm-good/compiled_punet_compiled_punet_linked_rocm_hsaco_fb.linked.ll b/home/benjacob/interm-bad/compiled_punet_compiled_punet_linked_rocm_hsaco_fb.linked.ll
index 3aa3582..d2834a5 100644
--- a/home/benjacob/interm-good/compiled_punet_compiled_punet_linked_rocm_hsaco_fb.linked.ll
+++ b/home/benjacob/interm-bad/compiled_punet_compiled_punet_linked_rocm_hsaco_fb.linked.ll
@@ -43173,65 +43173,53 @@ define amdgpu_kernel void @"main$async_dispatch_57_elementwise_2x4096x2560_f16xf
%49 = load <8 x half>, ptr addrspace(1) %48, align 2
%50 = load <1 x float>, ptr addrspace(1) %2, align 4
%51 = fdiv <8 x half> %47, %19
- %52 = fcmp olt <8 x half> %51, zeroinitializer
- %53 = fneg <8 x half> %51
// -----// IR Dump After CSE (cse) //----- //
func.func @_check_reordering_dispatch_0_generic_384_f32() {
%0 = ub.poison : vector<f32>
%cst = arith.constant dense<-0.000000e+00> : vector<1xf32>
%c4 = arith.constant 4 : index
%c384 = arith.constant 384 : index
%c0 = arith.constant 0 : index
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, ReadOnly>, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<384xf32>
memref.assume_alignment %1, 64 : memref<384xf32>
; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn memory(none)
define linkonce_odr protected float @__ocml_erf_f32(float noundef %0) local_unnamed_addr #2 {
%2 = tail call float @llvm.fabs.f32(float %0)
%3 = fcmp olt float %2, 1.000000e+00
br i1 %3, label %4, label %12
4: ; preds = %1
%5 = fmul float %0, %0
%6 = tail call float @__ocml_fmuladd_f32(float noundef %5, float noundef 0xBF4268BC20000000, float noundef 0x3F74208280000000) #15
%7 = tail call float @__ocml_fmuladd_f32(float noundef %5, float noundef %6, float noundef 0xBF9B593700000000) #15
// -----// IR Dump After NormalizeLoopBoundsPass (iree-codegen-normalize-loop-bounds) //----- //
func.func @matmul_i8_dispatch_3() attributes {translation_info = #iree_codegen.translation_info<pipeline = LLVMGPUTileAndFuse workgroup_size = [256, 1, 1] subgroup_size = 64, {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = false, no_reduce_shared_memory_bank_conflicts = true, use_igemm_convolution = false>}>} {
%c16384_i32 = arith.constant 16384 : i32
%c4_i32 = arith.constant 4 : i32
%c1_i32 = arith.constant 1 : i32
%c8_i32 = arith.constant 8 : i32
%c2_i32 = arith.constant 2 : i32
%c0 = arith.constant 0 : index
%c16384 = arith.constant 16384 : index
@bjacob
bjacob / README.md
Last active January 21, 2025 16:16
Explore data tiling on CPU

Reproduce a data-tiled matmul on CPU

Test source program: matmul_i8_128x128.mlir:

func.func @matmul_i8(%lhs: tensor<128x128xi8>, %rhs: tensor<128x128xi8>, %acc: tensor<128x128xi32>) -> tensor<128x128xi32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<128x128xi8>, tensor<128x128xi8>) outs(%acc: tensor<128x128xi32>) -> tensor<128x128xi32>
  return %result: tensor<128x128xi32>
}

Overview

At the moment, there is a disconnect in "fast math" semantics between what we do in MLIR rewrites, and what we let LLVM do after we have lowered to LLVM:

  • In MLIR rewrites, we are performing many "fast math"-like transformations. For examples, reassociations ((x+y)+z -> x+(y+z)).
    • That is necessary to implement something like a matrix multiplication efficiently, at multiple levels. At the instruction level, if we are targeting matrix-multiplication instructions, that is in itself a reassociation. At workgroup-distribution level, whenever we split a reduction dimension, that is a reassociation.
  • The LLVM IR that we bottom out on does not have fast-math flags.

So we are not allowing LLVM to perform the same kind of optimizations that we allowed ourselves. That inconsistence is the topic of this issue. It matters because, while any choice of fast-math semantics is a trade-off on an axis between performance and exactness, an inconsistency