In order to offload to GPU a model must fit in VRAM.
The following table lists models assuming Q4_K_M quantization. You can always use a model that fits in a smaller VRAM size.
VRAM | Models |
---|---|
384GB | - DeepSeek V3 671 |
128GB | - Mistral Large 2411 123b |
64GB | - Qwen2.5 72b |
48GB | - Llama 3.3 70b |
24GB | - QwQ Preview 32b - Qwen2.5 Coder 32b - Gemma2 27b |
16GB | - Phi-4 14b - Qwen2.5 Coder 14b |
8GB | - Granite 3.2 8b - Llama 3.1 8b - Qwen2.5 Coder 7b - Gemma2 9b |
4GB | - Granite 3.2 3b - Llama 3.2 3b - Qwen2.5 Coder 3b - Gemma2 2b |
TL;DR: Prefer Q5_K_M when available or Q4_K_M.
- A quantized model takes less disk space, less VRAM and runs faster.
- High bit modern quantization like Q5_K_M and Q6_K sacrifice almost no accuracy while gaining substantial efficiency.
- Low bit quantization like Q2_K sacrifices noticiable accuracy.
- Q4_K_M is a fast and light option that loses little. Q5_K_M loses almost nothing and is slightly preferred.
- Modern K-quantization utilize bit better and allocate more important tensors more bits.
- Q4_K_M for example may have more squashable tensors in Q4_K with more important tensors in Q5_K, Q6_K and always some in unquantized f32.
- The "M" stands for medium. Q4_K is shorthand for Q4_K_M. "S" is small and "L" is large. "S" comes at a cost.
- Generally avoid legacy Q4_0-style quantization and prefer modern Q4_K-style quantization.
- Quantization is the process of converting a model from floating-point to a lower precision format.
- A large parameter model at Q3_K_M is often preferred to a smaller parameter model at higher quantization.
Apple Silicon has a unified memory architecture, meaning the CPU and GPU share the same memory. Three-quarters of the unified memory is available to the GPU by default. That means a 24GB M-series has the equivalent of 18GB of VRAM actually available. A 128GB M-series has the equivalent of 96GB of VRAM, allocating 32GB to the CPU.
The default amount can be increased by setting iogpu.wired_limit_mb
to a higher value.
For example, to set the VRAM to 116GB it would be 116*1024MB, which is 118784MB.
Determine your target GB and times it by 1024 to get the MB value.
sudo sysctl iogpu.wired_limit_mb=118784 # Example 116GB for a 128GB M-series
Once you are comfortable with the amount chosen, you can persist the settings.
Create a /etc/sysctl.conf
file with the appropriate amount of MB of VRAM.
echo "iogpu.wired_limit_mb=118784" | sudo tee -a /etc/sysctl.conf