model | arc_challenge | openbookqa | winogrande | summary | tok/sec | RAM (GB) | Disk (GB) |
---|---|---|---|---|---|---|---|
gemma-3-4b-it-qat-UD-Q4_K_XL | 0.5546 | 0.4520 | 0.6796 | 0.562 | 34.39 | 3 | 2.54 |
gemma-3-4b-it-UD-Q2_K_XL | 0.4983 | 0.4440 | 0.6598 | 0.534 | 40.75 | 2.32 | 1.77 |
gemma-3-4b-it-qat-UD-Q2_K_XL | 0.4881 | 0.4420 | 0.6575 | 0.5292 | 40.42 | 2.31 | 1.77 |
gemma-3-4b-it-qat-bf16 | 0.4326 | 0.4340 | 0.6180 | 0.4949 | 12.94 | 9.5 | 10 |
gemma-3-4b-it-qat-4bit | 0.4292 | 0.4240 | 0.6046 | 0.4859 | 42.54 | 2.6 | 3 |
gemma-3-4b-it-4bit | 0.4232 | 0.4080 | 0.6014 | 0.4775 | 42.73 | 2.6 | 3.45 |
gemma-3-4b-it-qat-awq (ultra) | 0.4104 | 0.4180 | 0.5841 | 0.4708 | - | - | 3.27 |
gemma-3-4b-it-qat-awq (optimized) | 0.3951 | 0.4020 | 0.5856 | 0.4609 | - | - | 2.68 |
gemma-3-4b-it-qat-awq (default) | 0.3387 | 0.3480 | 0.5485 | 0.4118 | 53.72 | 2.3 | 2.28 |
model | arc_challenge | openbookqa | winogrande | summary |
---|---|---|---|---|
gemma-3-4b-it-qat-UD-Q4_K_XL | 0.5546 | 0.4520 | 0.6796 | 0.562 |
gemma-3-4b-it-UD-Q2_K_XL* | 0.4983 | 0.4440 | 0.6598 | 0.534 |
gemma-3-4b-it-qat-UD-Q2_K_XL | 0.4881 | 0.4420 | 0.6575 | 0.5292 |
* non QAT model
Evaluation done using lm_eval and llama_cpp (this combo is very slow on mac).
python -m llama_cpp.server --port 8000 --n_batch 4096 --n_ubatch 4096 --cache True --flash_attn True --model model.gguf
lm_eval --model gguf --model_args base_url=http://localhost:8000 --tasks winogrande,openbookqa,arc_challenge --trust_remote_code --seed 1234 --device mps
model | arc_challenge | openbookqa | winogrande | summary |
---|---|---|---|---|
gemma-3-4b-it-qat-bf16 | 0.4326 | 0.4340 | 0.6180 | 0.4949 |
gemma-3-4b-it-qat-4bit | 0.4292 | 0.4240 | 0.6046 | 0.4859 |
gemma-3-4b-it-4bit* | 0.4232 | 0.4080 | 0.6014 | 0.4775 |
* non QAT model
MLX evaluation was much much faster (20 min instead of 11 hrs).
mlx_lm.evaluate \
--model mlx_model \
--tasks arc_challenge winogrande openbookqa \
--seed 1234
model | bits | embed bits | num samples | n grid | arc_challenge | openbookqa | winogrande | summary |
---|---|---|---|---|---|---|---|---|
gemma-3-4b-it-qat-awq (ultra) | 4 | 8 | 1024 | 20 | 0.4104 | 0.4180 | 0.5841 | 0.4708 |
gemma-3-4b-it-qat-awq (optimized) | 4 | 4 | 2048 | 20 | 0.3951 | 0.4020 | 0.5856 | 0.4609 |
gemma-3-4b-it-qat-awq (default) | 4 | 4 | 2048 | 10 | 0.3387 | 0.3480 | 0.5485 | 0.4118 |
Same mlx_lm.evaluate
command was used.
Default conversion command
mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 # --bits 4 --group-size 64 --embed-bits 4 --embed-group-size 32 --num-samples 32 --sequence-length 2048 --n-grid 10 --seed 123
Optimized conversion command
mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 --bits 4 --group-size 64 --embed-bits 4 --embed-group-size 32 --num-samples 64 --sequence-length 2048 --n-grid 20 --seed 123
Ultra conversion command (max values before OOM on 16 GB VRAM)
mlx_lm.awq --model mlx-community/gemma-3-4b-it-qat-bf16 --bits 4 --group-size 64 --embed-bits 8 --embed-group-size 64 --num-samples 64 --sequence-length 1024 --n-grid 20 --seed 123