Running a 26B parameter model with only 6GB RAM using mmap
This guide covers running Gemma 4 26B MOE (Mixture of Experts) locally on an Apple Silicon Mac using llama.cpp with memory-mapped files. The MOE architecture activates only 8 of 128 experts per token, making it incredibly efficient.
| Spec | Value |
|---|---|
| Total Parameters | 25.2B |
| Active Parameters | 3.8B (per token) |
| Experts | 128 total, 8 active |
| Context Window | 262K tokens |
| Model File | 25.9 GB |
| RAM Usage | ~6 GB (with mmap) |
brew install llama.cppVerify installation:
llama-cli --versionRequires version 8680+ for Gemma 4 support.
# Check RAM
system_profiler SPHardwareDataType | grep -E "Memory|Chip"
# Check disk space (need ~30GB free)
df -h /mkdir -p ~/.cache/gguf
cd ~/.cache/gguf
curl -L -o gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
"https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"Size: 27.9 GB Download time: ~40 minutes at 12 MB/s
| Quantization | Size | Quality | Download |
|---|---|---|---|
| UD-Q8_K_XL | 27.9 GB | Best | Download |
| Q8_0 | 26.9 GB | Excellent | Download |
| UD-Q6_K | 22.9 GB | Very Good | Download |
| UD-Q5_K_M | 21.2 GB | Good | Download |
| UD-Q4_K_M | 16.9 GB | Balanced | Download |
Start the server:
llama-server \
-m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8082Query via curl:
curl http://localhost:8082/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "What is NSCLC?"}],
"max_tokens": 200
}'Query via Python:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8082/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="gemma4",
messages=[{"role": "user", "content": "What is NSCLC?"}],
max_tokens=200
)
print(response.choices[0].message.content)llama-cli \
-m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
-ngl 99 \
--conversationllama-bench \
-m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
-ngl 99 \
-n 128 \
-p 32| Flag | Purpose |
|---|---|
-m <path> |
Path to GGUF model file |
-ngl 99 |
Offload all layers to GPU (Metal) |
--host 0.0.0.0 |
Listen on all interfaces (network access) |
--port <n> |
Server port |
--mlock |
Force model into RAM (disable mmap) |
--no-mmap |
Disable memory mapping (slower, more RAM) |
Note: mmap is ON by default. The model stays on SSD and pages are loaded on demand.
| Metric | Value |
|---|---|
| Model Size | 25.9 GB |
| RAM Usage | 5.9 GB |
| Prompt Processing | 268 tok/s |
| Token Generation | 49.4 tok/s |
| Bits Per Weight | 8.83 |
| Metric | Q5_K_M | UD-Q8_K_XL |
|---|---|---|
| File Size | 17.8 GB | 25.9 GB |
| RAM (mmap) | 6.3 GB | 5.9 GB |
| Generation | 50.9 tok/s | 49.4 tok/s |
| Quality | 6.06 BPW | 8.83 BPW |
| Model | Size | RAM | Generation |
|---|---|---|---|
| MedGemma 1.5 4B (bf16) | 9.3 GB | 9.3 GB | 36 tok/s |
| Gemma 4 26B MOE (Q8) | 25.9 GB | 5.9 GB | 49 tok/s |
The 26B MOE model is faster and uses less RAM than the 4B dense model!
- 128 experts total, but only 8 activate per token
- Active parameters: 3.8B (not 25.2B)
- Inactive expert weights stay on SSD with mmap
- Model file stays on SSD
- Only accessed pages load into RAM
- Fast NVMe = negligible latency penalty
- Perfect for MOE where most weights are cold
- Unified memory architecture
- Metal GPU acceleration (
-ngl 99) - Fast NVMe storage (3+ GB/s)
Update llama.cpp:
brew upgrade llama.cpp# Find what's using the port
lsof -i :8082
# Kill llama servers
killall llama-serverCheck GPU offloading is enabled:
# Should see "offloaded X/X layers to GPU"
llama-server -m <model> -ngl 99 2>&1 | grep offload# Check space
df -h /
# Clean up old models
ls -lh ~/.cache/gguf/Save as ~/bin/gemma4-server:
#!/bin/bash
MODEL="${GEMMA4_MODEL:-$HOME/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf}"
PORT="${GEMMA4_PORT:-8082}"
if [ ! -f "$MODEL" ]; then
echo "Model not found. Downloading..."
mkdir -p ~/.cache/gguf
curl -L -o "$MODEL" \
"https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"
fi
echo "Starting Gemma 4 26B MOE server on port $PORT..."
exec llama-server \
-m "$MODEL" \
-ngl 99 \
--host 0.0.0.0 \
--port "$PORT"Make it executable:
chmod +x ~/bin/gemma4-serverLast updated: 2026-04-14