Created
August 21, 2025 22:38
-
-
Save andrewor14/2b69861957f54a71bc10bab145d2b1ed to your computer and use it in GitHub Desktop.
Debug SmolLM3 QAT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| $ GRADIO_SERVER_NAME="0.0.0.0" python test_sayak.py | |
| /home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_8 is deprecated and will be removed in torchao 0.14.0 | |
| warnings.warn(self.msg) | |
| /home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_7 is deprecated and will be removed in torchao 0.14.0 | |
| warnings.warn(self.msg) | |
| Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 65.28it/s] | |
| Step 1: Applying QAT observers to the model... | |
| /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FakeQuantizeConfig' is deprecated and will be removed in a future release. Please use the following API instead: | |
| base_config = Int8DynamicActivationInt4WeightConfig(group_size=32) | |
| quantize_(model, QATConfig(base_config, step="prepare")) | |
| # train (not shown) | |
| quantize_(model, QATConfig(base_config, step="convert")) | |
| Alternatively, if you prefer to pass in fake quantization configs: | |
| activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) | |
| weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32) | |
| qat_config = QATConfig( | |
| activation_config=activation_config, | |
| weight_config=weight_config, | |
| step="prepare", | |
| ) | |
| quantize_(model, qat_config) | |
| Please see https://github.com/pytorch/ao/issues/2630 for more details. | |
| warnings.warn( | |
| /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'IntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead: | |
| base_config = Int8DynamicActivationInt4WeightConfig(group_size=32) | |
| quantize_(model, QATConfig(base_config, step="prepare")) | |
| # train (not shown) | |
| quantize_(model, QATConfig(base_config, step="convert")) | |
| Alternatively, if you prefer to pass in fake quantization configs: | |
| activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) | |
| weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32) | |
| qat_config = QATConfig( | |
| activation_config=activation_config, | |
| weight_config=weight_config, | |
| step="prepare", | |
| ) | |
| quantize_(model, qat_config) | |
| Please see https://github.com/pytorch/ao/issues/2630 for more details. | |
| warnings.warn( | |
| Step 2: Starting QAT fine-tuning... | |
| wandb: (1) Create a W&B account | |
| wandb: (2) Use an existing W&B account | |
| wandb: (3) Don't visualize my results | |
| wandb: Enter your choice: 3 | |
| wandb: You chose "Don't visualize my results" | |
| wandb: Tracking run with wandb version 0.21.1 | |
| wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. | |
| wandb: Run data is saved locally in /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w | |
| * Trackio project initialized: huggingface | |
| * Trackio metrics logged to: /home/andrewor/.cache/huggingface/trackio | |
| * View dashboard by running in your terminal: | |
| trackio show --project "huggingface" | |
| * or by running in Python: trackio.show(project="huggingface") | |
| 0%| | 0/2 [00:00<?, ?it/s]NCCL version 2.27.5+cuda12.8 | |
| `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | |
| 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.55s/it]wandb: WARNING URL not available in offline run | |
| {'train_runtime': 21.6025, 'train_samples_per_second': 1.481, 'train_steps_per_second': 0.093, 'train_loss': 0.11030352115631104, 'num_tokens': 24576.0, 'epoch': 0.04} | |
| 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00, 2.55s/it]* Uploading logs to Trackio Space: http://localhost:7860/ (please wait...) | |
| 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.40s/it] | |
| QAT fine-tuning complete. | |
| Step 3: Converting the model to a fully quantized state... | |
| /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FromIntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead: | |
| base_config = Int8DynamicActivationInt4WeightConfig(group_size=32) | |
| quantize_(model, QATConfig(base_config, step="prepare")) | |
| # train (not shown) | |
| quantize_(model, QATConfig(base_config, step="convert")) | |
| Alternatively, if you prefer to pass in fake quantization configs: | |
| activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) | |
| weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32) | |
| qat_config = QATConfig( | |
| activation_config=activation_config, | |
| weight_config=weight_config, | |
| step="prepare", | |
| ) | |
| quantize_(model, qat_config) | |
| Please see https://github.com/pytorch/ao/issues/2630 for more details. | |
| warnings.warn( | |
| /home/andrewor/local/pytorch/torch/__init__.py:1605: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/andrewor/local/pytorch/aten/src/ATen/Context.cpp:80.) | |
| _C._set_float32_matmul_precision(precision) | |
| Model converted to INT8 successfully. | |
| Sample of quantized model architecture: | |
| SmolLM3ForCausalLM( | |
| (model): SmolLM3Model( | |
| (embed_tokens): Embedding(128256, 2048, padding_idx=128004) | |
| (layers): ModuleList( | |
| (0-35): 36 x SmolLM3DecoderLayer( | |
| (self_attn): SmolLM3Attention( | |
| (q_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (k_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (v_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (o_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| ) | |
| (mlp): LigerSwiGLUMLP( | |
| (gate_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (up_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (down_proj): Linear(in_features=11008, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 11008]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| (act_fn): SiLU() | |
| ) | |
| (input_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None) | |
| (post_attention_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None) | |
| ) | |
| ) | |
| (norm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None) | |
| (rotary_emb): SmolLM3RotaryEmbedding() | |
| ) | |
| (lm_head): Linear(in_features=2048, out_features=128256, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([128256, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7))) | |
| ) | |
| ✅ Final quantized model saved to: ./smollm-mini-qat-final | |
| wandb: | |
| wandb: You can sync this run to the cloud by running: | |
| wandb: wandb sync /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w | |
| wandb: Find logs at: wandb/offline-run-20250821_153507-plwm8t6w/logs |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment