Skip to content

Instantly share code, notes, and snippets.

@starkdmi
Created April 29, 2025 07:12
Show Gist options
  • Save starkdmi/355c868960ec3eef4a205a954dc2cd2f to your computer and use it in GitHub Desktop.
Save starkdmi/355c868960ec3eef4a205a954dc2cd2f to your computer and use it in GitHub Desktop.
Gemma 3 Quantization Comparison

Gemma3 4B Quantization Comparison on Mac (M4)

model arc_challenge openbookqa winogrande summary tok/sec RAM (GB) Disk (GB)
gemma-3-4b-it-qat-UD-Q4_K_XL 0.5546 0.4520 0.6796 0.562 34.39 3 2.54
gemma-3-4b-it-UD-Q2_K_XL 0.4983 0.4440 0.6598 0.534 40.75 2.32 1.77
gemma-3-4b-it-qat-UD-Q2_K_XL 0.4881 0.4420 0.6575 0.5292 40.42 2.31 1.77
gemma-3-4b-it-qat-bf16 0.4326 0.4340 0.6180 0.4949 12.94 9.5 10
gemma-3-4b-it-qat-4bit 0.4292 0.4240 0.6046 0.4859 42.54 2.6 3
gemma-3-4b-it-4bit 0.4232 0.4080 0.6014 0.4775 42.73 2.6 3.45
gemma-3-4b-it-qat-awq (ultra) 0.4104 0.4180 0.5841 0.4708 - - 3.27
gemma-3-4b-it-qat-awq (optimized) 0.3951 0.4020 0.5856 0.4609 - - 2.68
gemma-3-4b-it-qat-awq (default) 0.3387 0.3480 0.5485 0.4118 53.72 2.3 2.28

model arc_challenge openbookqa winogrande summary
gemma-3-4b-it-qat-UD-Q4_K_XL 0.5546 0.4520 0.6796 0.562
gemma-3-4b-it-UD-Q2_K_XL* 0.4983 0.4440 0.6598 0.534
gemma-3-4b-it-qat-UD-Q2_K_XL 0.4881 0.4420 0.6575 0.5292

* non QAT model

Evaluation done using lm_eval and llama_cpp (this combo is very slow on mac).

python -m llama_cpp.server --port 8000 --n_batch 4096 --n_ubatch 4096 --cache True --flash_attn True --model model.gguf

lm_eval --model gguf --model_args base_url=http://localhost:8000 --tasks winogrande,openbookqa,arc_challenge --trust_remote_code --seed 1234 --device mps
model arc_challenge openbookqa winogrande summary
gemma-3-4b-it-qat-bf16 0.4326 0.4340 0.6180 0.4949
gemma-3-4b-it-qat-4bit 0.4292 0.4240 0.6046 0.4859
gemma-3-4b-it-4bit* 0.4232 0.4080 0.6014 0.4775

* non QAT model

MLX evaluation was much much faster (20 min instead of 11 hrs).

mlx_lm.evaluate \
    --model mlx_model \
    --tasks arc_challenge winogrande openbookqa \
    --seed 1234

AWQ on MLX

model bits embed bits num samples n grid arc_challenge openbookqa winogrande summary
gemma-3-4b-it-qat-awq (ultra) 4 8 1024 20 0.4104 0.4180 0.5841 0.4708
gemma-3-4b-it-qat-awq (optimized) 4 4 2048 20 0.3951 0.4020 0.5856 0.4609
gemma-3-4b-it-qat-awq (default) 4 4 2048 10 0.3387 0.3480 0.5485 0.4118

Same mlx_lm.evaluate command was used.

Default conversion command

mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 # --bits 4 --group-size 64 --embed-bits 4 --embed-group-size 32 --num-samples 32 --sequence-length 2048 --n-grid 10 --seed 123

Optimized conversion command

mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 --bits 4 --group-size 64 --embed-bits 4 --embed-group-size 32 --num-samples 64 --sequence-length 2048 --n-grid 20 --seed 123

Ultra conversion command (max values before OOM on 16 GB VRAM)

mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 --bits 4 --group-size 64 --embed-bits 8 --embed-group-size 64 --num-samples 64 --sequence-length 1024 --n-grid 20 --seed 123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment