Unsloth Gemma 4 MOE Setup Guide

Running a 26B parameter model with only 6GB RAM using mmap

Overview

This guide covers running Gemma 4 26B MOE (Mixture of Experts) locally on an Apple Silicon Mac using llama.cpp with memory-mapped files. The MOE architecture activates only 8 of 128 experts per token, making it incredibly efficient.

Spec	Value
Total Parameters	25.2B
Active Parameters	3.8B (per token)
Experts	128 total, 8 active
Context Window	262K tokens
Model File	25.9 GB
RAM Usage	~6 GB (with mmap)

Prerequisites

Install llama.cpp via Homebrew

brew install llama.cpp

Verify installation:

llama-cli --version

Requires version 8680+ for Gemma 4 support.

Check Your System

# Check RAM
system_profiler SPHardwareDataType | grep -E "Memory|Chip"

# Check disk space (need ~30GB free)
df -h /

Download the Model

Recommended: Unsloth UD-Q8_K_XL (Highest Quality)

mkdir -p ~/.cache/gguf
cd ~/.cache/gguf

curl -L -o gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  "https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"

Size: 27.9 GB Download time: ~40 minutes at 12 MB/s

Alternative Quantizations

Quantization	Size	Quality	Download
UD-Q8_K_XL	27.9 GB	Best	Download
Q8_0	26.9 GB	Excellent	Download
UD-Q6_K	22.9 GB	Very Good	Download
UD-Q5_K_M	21.2 GB	Good	Download
UD-Q4_K_M	16.9 GB	Balanced	Download

Running the Model

Option 1: OpenAI-Compatible Server

Start the server:

llama-server \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8082

Query via curl:

curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "What is NSCLC?"}],
    "max_tokens": 200
  }'

Query via Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8082/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "What is NSCLC?"}],
    max_tokens=200
)

print(response.choices[0].message.content)

Option 2: CLI Chat

llama-cli \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  --conversation

Option 3: Benchmark

llama-bench \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  -n 128 \
  -p 32

Key Flags

Flag	Purpose
`-m <path>`	Path to GGUF model file
`-ngl 99`	Offload all layers to GPU (Metal)
`--host 0.0.0.0`	Listen on all interfaces (network access)
`--port <n>`	Server port
`--mlock`	Force model into RAM (disable mmap)
`--no-mmap`	Disable memory mapping (slower, more RAM)

Note: mmap is ON by default. The model stays on SSD and pages are loaded on demand.

Benchmarks (Apple M1 Max, 64GB)

Unsloth UD-Q8_K_XL (Highest Quality)

Metric	Value
Model Size	25.9 GB
RAM Usage	5.9 GB
Prompt Processing	268 tok/s
Token Generation	49.4 tok/s
Bits Per Weight	8.83

Comparison with Q5_K_M

Metric	Q5_K_M	UD-Q8_K_XL
File Size	17.8 GB	25.9 GB
RAM (mmap)	6.3 GB	5.9 GB
Generation	50.9 tok/s	49.4 tok/s
Quality	6.06 BPW	8.83 BPW

vs MedGemma 1.5 4B (MLX)

Model	Size	RAM	Generation
MedGemma 1.5 4B (bf16)	9.3 GB	9.3 GB	36 tok/s
Gemma 4 26B MOE (Q8)	25.9 GB	5.9 GB	49 tok/s

The 26B MOE model is faster and uses less RAM than the 4B dense model!

Why This Works

Mixture of Experts (MOE)

128 experts total, but only 8 activate per token
Active parameters: 3.8B (not 25.2B)
Inactive expert weights stay on SSD with mmap

Memory-Mapped Files (mmap)

Model file stays on SSD
Only accessed pages load into RAM
Fast NVMe = negligible latency penalty
Perfect for MOE where most weights are cold

Apple Silicon Advantage

Unified memory architecture
Metal GPU acceleration (-ngl 99)
Fast NVMe storage (3+ GB/s)

Troubleshooting

"unknown model architecture: gemma4"

Update llama.cpp:

brew upgrade llama.cpp

Port already in use

# Find what's using the port
lsof -i :8082

# Kill llama servers
killall llama-server

Model too slow

Check GPU offloading is enabled:

# Should see "offloaded X/X layers to GPU"
llama-server -m <model> -ngl 99 2>&1 | grep offload

Out of disk space

# Check space
df -h /

# Clean up old models
ls -lh ~/.cache/gguf/

Quick Start Script

Save as ~/bin/gemma4-server:

#!/bin/bash
MODEL="${GEMMA4_MODEL:-$HOME/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf}"
PORT="${GEMMA4_PORT:-8082}"

if [ ! -f "$MODEL" ]; then
  echo "Model not found. Downloading..."
  mkdir -p ~/.cache/gguf
  curl -L -o "$MODEL" \
    "https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"
fi

echo "Starting Gemma 4 26B MOE server on port $PORT..."
exec llama-server \
  -m "$MODEL" \
  -ngl 99 \
  --host 0.0.0.0 \
  --port "$PORT"

Make it executable:

chmod +x ~/bin/gemma4-server

Resources

Last updated: 2026-04-14

imaurer/Unsloth Gemma 4 MOE Setup Guide.md

Select an option

No results found