This note is about what we can expect once workloads get optimized for microscaling.
Microscaling is about new hardware having new matrix-multiplications on small-bit-depth operands, plus separate scale factors. For instance, there is going to be a FP8 matrix multiplication instruction, accumulating in FP32, with additional "scale" FP32 operands applied as multipliers on the FP8 inputs just before multiply-accumulating them. There are also going to be new microscaling instructions for other small-bit-width floating-point and integer types.
Different hardware has always had different tile sizes. Different hardware has also supported different element types, but that mostly meant that other vendors caught up to the element types supported by each other. Once the same element type was supported, the differences in tile sizes were layout differ