mengniwang95’s gists

Motivation

Previously, we contributed to SGLang by enabling the loading and inference of AutoRound-quantized models. Building on this groundwork, we propose to add an unified quantization support for Intel platforms by integrating the AutoRound quantization algorithm into SGLang.

As an advanced algorithm for Intel® Neural Compressor, AutoRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), delivering exceptional results across 2 to 4 bits. Embedding AutoRound can enahnce SGLang’s online and offline quantization capabilities and provide users seamless access to state-of-the-art quantization that balances speed, accuracy, and ease of use.

Integration Proposal

One-step quantization (online)

Quantize the model during model loading.

Wang, Mengni mengniwang95

Motivation

Target Work

Motivation

Integration Proposal

One-step quantization (online)