Each hardware-specific microscaling format is a different quantization scheme

This note is about what we can expect once workloads get optimized for microscaling.

Microscaling is about new hardware having new matrix-multiplications on small-bit-depth operands, plus separate scale factors. For instance, there is going to be a FP8 matrix multiplication instruction, accumulating in FP32, with additional "scale" FP32 operands applied as multipliers on the FP8 inputs just before multiply-accumulating them. There are also going to be new microscaling instructions for other small-bit-width floating-point and integer types.

Different hardware has always had different tile sizes. Different hardware has also supported different element types, but that mostly meant that other vendors caught up to the element types supported by each other. Once the same element type was supported, the differences in tile sizes were layout differences in codegen. The parameters being learned during training were not concerned with that detail.

What is new with microscaling, is that the quantization scheme becomes hardware-dependent, even after the element type is fixed. The specific arithmetic being performed on small-bit-depth values, multiplying them by specific scale values, is specific to one microscaling format. That means that the values being learned during training are format-specific.

That has two consequences:

Regardless of how the training was implemented, the parameters themselves will be specific to one microscaling format.
- That means that we will be downloading from HuggingFace parameters that are specific to one vendor's microscaling formats.
Trying to run inference on a different target using instructions for different microscaling format will be lossy, because one will need to rescale low-bit-depth weights.
- For example, for FP8 microscaling with FP32 scales, the conversion will involve this arithmetic: converted_fp8_weight = cast<fp8>(original_fp8_weight * fp32_factor). The cast<fp8> here will be lossy and we will have to find ways to measure and minimize that --- in the past, that kind of scenario was dealt with using either stochastic rounding or error diffusion.
- The fp32_factor itself is nontrivial to estimate: it is conceptually the quotient old_fp32_scale / new_fp32_scale except that new_fp32_scale is part of the variable that we are solving for here.

bjacob/README.md

Each hardware-specific microscaling format is a different quantization scheme