Skip to content

Instantly share code, notes, and snippets.

Motivation

Previously, we supported the loading and inference of AutoRound-quantized models in SGlang. In 2026 Q1, we plan to make more contributions to SGLang to enrich its model quantization methods powered by Intel® Neural Compressor and AutoRound.

Target Work

  • Consolidate Intel quantization support with more quantization methods
  • Enable various formats (FP8/WNA16/MXFP4/MXFP8/NVFP4, and advanced mixed-precision recipes) with good accuracy, which benefit from the advanced algorithm AutoRound of Neural Compressor
  • Provide users seamless access to state-of-the-art quantization that balances speed, accuracy, and ease of use

Motivation

Previously, we contributed to SGLang by enabling the loading and inference of AutoRound-quantized models. Building on this groundwork, we propose to add an unified quantization support for Intel platforms by integrating the AutoRound quantization algorithm into SGLang.

As an advanced algorithm for Intel® Neural Compressor, AutoRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), delivering exceptional results across 2 to 4 bits. Embedding AutoRound can enahnce SGLang’s online and offline quantization capabilities and provide users seamless access to state-of-the-art quantization that balances speed, accuracy, and ease of use.

Integration Proposal

One-step quantization (online)

Quantize the model during model loading.