Skip to content

Instantly share code, notes, and snippets.

@andrewor14
Created August 21, 2025 22:38
Show Gist options
  • Select an option

  • Save andrewor14/2b69861957f54a71bc10bab145d2b1ed to your computer and use it in GitHub Desktop.

Select an option

Save andrewor14/2b69861957f54a71bc10bab145d2b1ed to your computer and use it in GitHub Desktop.
Debug SmolLM3 QAT
$ GRADIO_SERVER_NAME="0.0.0.0" python test_sayak.py
/home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_8 is deprecated and will be removed in torchao 0.14.0
warnings.warn(self.msg)
/home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_7 is deprecated and will be removed in torchao 0.14.0
warnings.warn(self.msg)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 65.28it/s]
Step 1: Applying QAT observers to the model...
/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FakeQuantizeConfig' is deprecated and will be removed in a future release. Please use the following API instead:
base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
quantize_(model, QATConfig(base_config, step="prepare"))
# train (not shown)
quantize_(model, QATConfig(base_config, step="convert"))
Alternatively, if you prefer to pass in fake quantization configs:
activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
qat_config = QATConfig(
activation_config=activation_config,
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
Please see https://github.com/pytorch/ao/issues/2630 for more details.
warnings.warn(
/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'IntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:
base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
quantize_(model, QATConfig(base_config, step="prepare"))
# train (not shown)
quantize_(model, QATConfig(base_config, step="convert"))
Alternatively, if you prefer to pass in fake quantization configs:
activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
qat_config = QATConfig(
activation_config=activation_config,
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
Please see https://github.com/pytorch/ao/issues/2630 for more details.
warnings.warn(
Step 2: Starting QAT fine-tuning...
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.21.1
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
* Trackio project initialized: huggingface
* Trackio metrics logged to: /home/andrewor/.cache/huggingface/trackio
* View dashboard by running in your terminal:
trackio show --project "huggingface"
* or by running in Python: trackio.show(project="huggingface")
0%| | 0/2 [00:00<?, ?it/s]NCCL version 2.27.5+cuda12.8
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.55s/it]wandb: WARNING URL not available in offline run
{'train_runtime': 21.6025, 'train_samples_per_second': 1.481, 'train_steps_per_second': 0.093, 'train_loss': 0.11030352115631104, 'num_tokens': 24576.0, 'epoch': 0.04}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00, 2.55s/it]* Uploading logs to Trackio Space: http://localhost:7860/ (please wait...)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.40s/it]
QAT fine-tuning complete.
Step 3: Converting the model to a fully quantized state...
/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FromIntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:
base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
quantize_(model, QATConfig(base_config, step="prepare"))
# train (not shown)
quantize_(model, QATConfig(base_config, step="convert"))
Alternatively, if you prefer to pass in fake quantization configs:
activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
qat_config = QATConfig(
activation_config=activation_config,
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
Please see https://github.com/pytorch/ao/issues/2630 for more details.
warnings.warn(
/home/andrewor/local/pytorch/torch/__init__.py:1605: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/andrewor/local/pytorch/aten/src/ATen/Context.cpp:80.)
_C._set_float32_matmul_precision(precision)
Model converted to INT8 successfully.
Sample of quantized model architecture:
SmolLM3ForCausalLM(
(model): SmolLM3Model(
(embed_tokens): Embedding(128256, 2048, padding_idx=128004)
(layers): ModuleList(
(0-35): 36 x SmolLM3DecoderLayer(
(self_attn): SmolLM3Attention(
(q_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(k_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(v_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(o_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
)
(mlp): LigerSwiGLUMLP(
(gate_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(up_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(down_proj): Linear(in_features=11008, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 11008]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
(act_fn): SiLU()
)
(input_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
(post_attention_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
)
)
(norm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
(rotary_emb): SmolLM3RotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([128256, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
)
✅ Final quantized model saved to: ./smollm-mini-qat-final
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
wandb: Find logs at: wandb/offline-run-20250821_153507-plwm8t6w/logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment