andrewor14 · August 21, 2025 22:38
diff --git a/gistfile1.txt b/gistfile1.txt
 $ GRADIO_SERVER_NAME="0.0.0.0" python test_sayak.py 
 /home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_8 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
 /home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_7 is deprecated and will be removed in torchao 0.14.0
  warnings.warn(self.msg)
 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 65.28it/s]

 Step 1: Applying QAT observers to the model...
 /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FakeQuantizeConfig' is deprecated and will be removed in a future release. Please use the following API instead:

    base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
    quantize_(model, QATConfig(base_config, step="prepare"))
    # train (not shown)
    quantize_(model, QATConfig(base_config, step="convert"))

 Alternatively, if you prefer to pass in fake quantization configs:

    activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
    weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
    qat_config = QATConfig(
        activation_config=activation_config,
        weight_config=weight_config,
        step="prepare",
    )
    quantize_(model, qat_config)

 Please see https://github.com/pytorch/ao/issues/2630 for more details.
        
  warnings.warn(
 /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'IntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:

    base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
    quantize_(model, QATConfig(base_config, step="prepare"))
    # train (not shown)
    quantize_(model, QATConfig(base_config, step="convert"))

 Alternatively, if you prefer to pass in fake quantization configs:

    activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
    weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
    qat_config = QATConfig(
        activation_config=activation_config,
        weight_config=weight_config,
        step="prepare",
    )
    quantize_(model, qat_config)

 Please see https://github.com/pytorch/ao/issues/2630 for more details.
        
  warnings.warn(

 Step 2: Starting QAT fine-tuning...
 wandb: (1) Create a W&B account
 wandb: (2) Use an existing W&B account
 wandb: (3) Don't visualize my results
 wandb: Enter your choice: 3
 wandb: You chose "Don't visualize my results"
 wandb: Tracking run with wandb version 0.21.1
 wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
 wandb: Run data is saved locally in /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
 * Trackio project initialized: huggingface
 * Trackio metrics logged to: /home/andrewor/.cache/huggingface/trackio
 * View dashboard by running in your terminal:
 trackio show --project "huggingface"
 * or by running in Python: trackio.show(project="huggingface")
  0%|                                                                                                                                                                                                              | 0/2 [00:00<?, ?it/s]NCCL version 2.27.5+cuda12.8
 `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.55s/it]wandb: WARNING URL not available in offline run
 {'train_runtime': 21.6025, 'train_samples_per_second': 1.481, 'train_steps_per_second': 0.093, 'train_loss': 0.11030352115631104, 'num_tokens': 24576.0, 'epoch': 0.04}                                                                  
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:19<00:00,  2.55s/it]* Uploading logs to Trackio Space: http://localhost:7860/ (please wait...)
 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.40s/it]
 QAT fine-tuning complete.

 Step 3: Converting the model to a fully quantized state...
 /home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FromIntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:

    base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
    quantize_(model, QATConfig(base_config, step="prepare"))
    # train (not shown)
    quantize_(model, QATConfig(base_config, step="convert"))

 Alternatively, if you prefer to pass in fake quantization configs:

    activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
    weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
    qat_config = QATConfig(
        activation_config=activation_config,
        weight_config=weight_config,
        step="prepare",
    )
    quantize_(model, qat_config)

 Please see https://github.com/pytorch/ao/issues/2630 for more details.
        
  warnings.warn(
 /home/andrewor/local/pytorch/torch/__init__.py:1605: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/andrewor/local/pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
 Model converted to INT8 successfully.

 Sample of quantized model architecture:
 SmolLM3ForCausalLM(
  (model): SmolLM3Model(
    (embed_tokens): Embedding(128256, 2048, padding_idx=128004)
    (layers): ModuleList(
      (0-35): 36 x SmolLM3DecoderLayer(
        (self_attn): SmolLM3Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (k_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (v_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (o_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
        )
        (mlp): LigerSwiGLUMLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (up_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (down_proj): Linear(in_features=11008, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 11008]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
          (act_fn): SiLU()
        )
        (input_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
        (post_attention_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
      )
    )
    (norm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
    (rotary_emb): SmolLM3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([128256, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
 )

 ✅ Final quantized model saved to: ./smollm-mini-qat-final
 wandb: 
 wandb: You can sync this run to the cloud by running:
 wandb: wandb sync /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
 wandb: Find logs at: wandb/offline-run-20250821_153507-plwm8t6w/logs
	$ GRADIO_SERVER_NAME="0.0.0.0" python test_sayak.py
	/home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_8 is deprecated and will be removed in torchao 0.14.0
	warnings.warn(self.msg)
	/home/andrewor/local/ao/torchao/utils.py:408: UserWarning: TORCH_VERSION_AT_LEAST_2_7 is deprecated and will be removed in torchao 0.14.0
	warnings.warn(self.msg)
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 65.28it/s]

	Step 1: Applying QAT observers to the model...
	/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FakeQuantizeConfig' is deprecated and will be removed in a future release. Please use the following API instead:

	base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
	quantize_(model, QATConfig(base_config, step="prepare"))
	# train (not shown)
	quantize_(model, QATConfig(base_config, step="convert"))

	Alternatively, if you prefer to pass in fake quantization configs:

	activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
	weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
	qat_config = QATConfig(
	activation_config=activation_config,
	weight_config=weight_config,
	step="prepare",
	)
	quantize_(model, qat_config)

	Please see https://github.com/pytorch/ao/issues/2630 for more details.

	warnings.warn(
	/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'IntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:

	base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
	quantize_(model, QATConfig(base_config, step="prepare"))
	# train (not shown)
	quantize_(model, QATConfig(base_config, step="convert"))

	Alternatively, if you prefer to pass in fake quantization configs:

	activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
	weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
	qat_config = QATConfig(
	activation_config=activation_config,
	weight_config=weight_config,
	step="prepare",
	)
	quantize_(model, qat_config)

	Please see https://github.com/pytorch/ao/issues/2630 for more details.

	warnings.warn(

	Step 2: Starting QAT fine-tuning...
	wandb: (1) Create a W&B account
	wandb: (2) Use an existing W&B account
	wandb: (3) Don't visualize my results
	wandb: Enter your choice: 3
	wandb: You chose "Don't visualize my results"
	wandb: Tracking run with wandb version 0.21.1
	wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
	wandb: Run data is saved locally in /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
	* Trackio project initialized: huggingface
	* Trackio metrics logged to: /home/andrewor/.cache/huggingface/trackio
	* View dashboard by running in your terminal:
	trackio show --project "huggingface"
	* or by running in Python: trackio.show(project="huggingface")
	0%\| \| 0/2 [00:00<?, ?it/s]NCCL version 2.27.5+cuda12.8
	`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:05<00:00, 2.55s/it]wandb: WARNING URL not available in offline run
	{'train_runtime': 21.6025, 'train_samples_per_second': 1.481, 'train_steps_per_second': 0.093, 'train_loss': 0.11030352115631104, 'num_tokens': 24576.0, 'epoch': 0.04}
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:19<00:00, 2.55s/it]* Uploading logs to Trackio Space: http://localhost:7860/ (please wait...)
	100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:20<00:00, 10.40s/it]
	QAT fine-tuning complete.

	Step 3: Converting the model to a fully quantized state...
	/home/andrewor/local/ao/torchao/quantization/qat/utils.py:84: UserWarning: 'FromIntXQuantizationAwareTrainingConfig' is deprecated and will be removed in a future release. Please use the following API instead:

	base_config = Int8DynamicActivationInt4WeightConfig(group_size=32)
	quantize_(model, QATConfig(base_config, step="prepare"))
	# train (not shown)
	quantize_(model, QATConfig(base_config, step="convert"))

	Alternatively, if you prefer to pass in fake quantization configs:

	activation_config = IntxFakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
	weight_config = IntxFakeQuantizeConfig(torch.int4, group_size=32)
	qat_config = QATConfig(
	activation_config=activation_config,
	weight_config=weight_config,
	step="prepare",
	)
	quantize_(model, qat_config)

	Please see https://github.com/pytorch/ao/issues/2630 for more details.

	warnings.warn(
	/home/andrewor/local/pytorch/torch/__init__.py:1605: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/andrewor/local/pytorch/aten/src/ATen/Context.cpp:80.)
	_C._set_float32_matmul_precision(precision)
	Model converted to INT8 successfully.

	Sample of quantized model architecture:
	SmolLM3ForCausalLM(
	(model): SmolLM3Model(
	(embed_tokens): Embedding(128256, 2048, padding_idx=128004)
	(layers): ModuleList(
	(0-35): 36 x SmolLM3DecoderLayer(
	(self_attn): SmolLM3Attention(
	(q_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(k_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(v_proj): Linear(in_features=2048, out_features=512, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([512, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(o_proj): Linear(in_features=2048, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	)
	(mlp): LigerSwiGLUMLP(
	(gate_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(up_proj): Linear(in_features=2048, out_features=11008, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([11008, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(down_proj): Linear(in_features=11008, out_features=2048, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([2048, 11008]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	(act_fn): SiLU()
	)
	(input_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
	(post_attention_layernorm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
	)
	)
	(norm): LigerRMSNorm((2048,), eps=1e-06, offset=0.0, in_place=True, row_mode=None)
	(rotary_emb): SmolLM3RotaryEmbedding()
	)
	(lm_head): Linear(in_features=2048, out_features=128256, weight=LinearActivationQuantizedTensor(activation=<function _int8_asymm_per_token_quant at 0x7f85d48da170>, weight=AffineQuantizedTensor(shape=torch.Size([128256, 2048]), block_size=(1, 32), device=cuda:0, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=-8, quant_max=7)))
	)

	✅ Final quantized model saved to: ./smollm-mini-qat-final
	wandb:
	wandb: You can sync this run to the cloud by running:
	wandb: wandb sync /home/andrewor/scratch/wandb/offline-run-20250821_153507-plwm8t6w
	wandb: Find logs at: wandb/offline-run-20250821_153507-plwm8t6w/logs
No results found