Skip to content

Instantly share code, notes, and snippets.

@havenwood
Last active March 12, 2025 22:39
Show Gist options
  • Save havenwood/e459712f4ac01afdddbda70a181b87b8 to your computer and use it in GitHub Desktop.
Save havenwood/e459712f4ac01afdddbda70a181b87b8 to your computer and use it in GitHub Desktop.
LLM models high level overview

Models

In order to offload to GPU a model must fit in VRAM.

The following table lists models assuming Q4_K_M quantization. You can always use a model that fits in a smaller VRAM size.

VRAM Models
384GB - DeepSeek V3 671
128GB - Mistral Large 2411 123b
64GB - Qwen2.5 72b
48GB - Llama 3.3 70b
24GB - QwQ Preview 32b
- Qwen2.5 Coder 32b
- Gemma2 27b
16GB - Phi-4 14b
- Qwen2.5 Coder 14b
8GB - Granite 3.2 8b
- Llama 3.1 8b
- Qwen2.5 Coder 7b
- Gemma2 9b
4GB - Granite 3.2 3b
- Llama 3.2 3b
- Qwen2.5 Coder 3b
- Gemma2 2b

Quantization

TL;DR: Prefer Q5_K_M when available or Q4_K_M.

  • A quantized model takes less disk space, less VRAM and runs faster.
  • High bit modern quantization like Q5_K_M and Q6_K sacrifice almost no accuracy while gaining substantial efficiency.
  • Low bit quantization like Q2_K sacrifices noticiable accuracy.
  • Q4_K_M is a fast and light option that loses little. Q5_K_M loses almost nothing and is slightly preferred.
  • Modern K-quantization utilize bit better and allocate more important tensors more bits.
  • Q4_K_M for example may have more squashable tensors in Q4_K with more important tensors in Q5_K, Q6_K and always some in unquantized f32.
  • The "M" stands for medium. Q4_K is shorthand for Q4_K_M. "S" is small and "L" is large. "S" comes at a cost.
  • Generally avoid legacy Q4_0-style quantization and prefer modern Q4_K-style quantization.
  • Quantization is the process of converting a model from floating-point to a lower precision format.
  • A large parameter model at Q3_K_M is often preferred to a smaller parameter model at higher quantization.

Apple Silicon

VRAM

Apple Silicon has a unified memory architecture, meaning the CPU and GPU share the same memory. Three-quarters of the unified memory is available to the GPU by default. That means a 24GB M-series has the equivalent of 18GB of VRAM actually available. A 128GB M-series has the equivalent of 96GB of VRAM, allocating 32GB to the CPU.

The default amount can be increased by setting iogpu.wired_limit_mb to a higher value. For example, to set the VRAM to 116GB it would be 116*1024MB, which is 118784MB. Determine your target GB and times it by 1024 to get the MB value.

sudo sysctl iogpu.wired_limit_mb=118784 # Example 116GB for a 128GB M-series

Once you are comfortable with the amount chosen, you can persist the settings. Create a /etc/sysctl.conf file with the appropriate amount of MB of VRAM.

echo "iogpu.wired_limit_mb=118784" | sudo tee -a /etc/sysctl.conf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment