Skip to content

Instantly share code, notes, and snippets.

@imaurer
Created April 14, 2026 23:26
Show Gist options
  • Select an option

  • Save imaurer/00059ab1c60abab16804ad8e22715bf1 to your computer and use it in GitHub Desktop.

Select an option

Save imaurer/00059ab1c60abab16804ad8e22715bf1 to your computer and use it in GitHub Desktop.
This guide covers running **Gemma 4 26B MOE** (Mixture of Experts) locally on an Apple Silicon Mac using llama.cpp with memory-mapped files. The MOE architecture activates only 8 of 128 experts per token, making it incredibly efficient.

Unsloth Gemma 4 MOE Setup Guide

Running a 26B parameter model with only 6GB RAM using mmap

Overview

This guide covers running Gemma 4 26B MOE (Mixture of Experts) locally on an Apple Silicon Mac using llama.cpp with memory-mapped files. The MOE architecture activates only 8 of 128 experts per token, making it incredibly efficient.

Spec Value
Total Parameters 25.2B
Active Parameters 3.8B (per token)
Experts 128 total, 8 active
Context Window 262K tokens
Model File 25.9 GB
RAM Usage ~6 GB (with mmap)

Prerequisites

Install llama.cpp via Homebrew

brew install llama.cpp

Verify installation:

llama-cli --version

Requires version 8680+ for Gemma 4 support.

Check Your System

# Check RAM
system_profiler SPHardwareDataType | grep -E "Memory|Chip"

# Check disk space (need ~30GB free)
df -h /

Download the Model

Recommended: Unsloth UD-Q8_K_XL (Highest Quality)

mkdir -p ~/.cache/gguf
cd ~/.cache/gguf

curl -L -o gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  "https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"

Size: 27.9 GB Download time: ~40 minutes at 12 MB/s

Alternative Quantizations

Quantization Size Quality Download
UD-Q8_K_XL 27.9 GB Best Download
Q8_0 26.9 GB Excellent Download
UD-Q6_K 22.9 GB Very Good Download
UD-Q5_K_M 21.2 GB Good Download
UD-Q4_K_M 16.9 GB Balanced Download

Running the Model

Option 1: OpenAI-Compatible Server

Start the server:

llama-server \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8082

Query via curl:

curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "What is NSCLC?"}],
    "max_tokens": 200
  }'

Query via Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8082/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "What is NSCLC?"}],
    max_tokens=200
)

print(response.choices[0].message.content)

Option 2: CLI Chat

llama-cli \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  --conversation

Option 3: Benchmark

llama-bench \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  -n 128 \
  -p 32

Key Flags

Flag Purpose
-m <path> Path to GGUF model file
-ngl 99 Offload all layers to GPU (Metal)
--host 0.0.0.0 Listen on all interfaces (network access)
--port <n> Server port
--mlock Force model into RAM (disable mmap)
--no-mmap Disable memory mapping (slower, more RAM)

Note: mmap is ON by default. The model stays on SSD and pages are loaded on demand.


Benchmarks (Apple M1 Max, 64GB)

Unsloth UD-Q8_K_XL (Highest Quality)

Metric Value
Model Size 25.9 GB
RAM Usage 5.9 GB
Prompt Processing 268 tok/s
Token Generation 49.4 tok/s
Bits Per Weight 8.83

Comparison with Q5_K_M

Metric Q5_K_M UD-Q8_K_XL
File Size 17.8 GB 25.9 GB
RAM (mmap) 6.3 GB 5.9 GB
Generation 50.9 tok/s 49.4 tok/s
Quality 6.06 BPW 8.83 BPW

vs MedGemma 1.5 4B (MLX)

Model Size RAM Generation
MedGemma 1.5 4B (bf16) 9.3 GB 9.3 GB 36 tok/s
Gemma 4 26B MOE (Q8) 25.9 GB 5.9 GB 49 tok/s

The 26B MOE model is faster and uses less RAM than the 4B dense model!


Why This Works

Mixture of Experts (MOE)

  • 128 experts total, but only 8 activate per token
  • Active parameters: 3.8B (not 25.2B)
  • Inactive expert weights stay on SSD with mmap

Memory-Mapped Files (mmap)

  • Model file stays on SSD
  • Only accessed pages load into RAM
  • Fast NVMe = negligible latency penalty
  • Perfect for MOE where most weights are cold

Apple Silicon Advantage

  • Unified memory architecture
  • Metal GPU acceleration (-ngl 99)
  • Fast NVMe storage (3+ GB/s)

Troubleshooting

"unknown model architecture: gemma4"

Update llama.cpp:

brew upgrade llama.cpp

Port already in use

# Find what's using the port
lsof -i :8082

# Kill llama servers
killall llama-server

Model too slow

Check GPU offloading is enabled:

# Should see "offloaded X/X layers to GPU"
llama-server -m <model> -ngl 99 2>&1 | grep offload

Out of disk space

# Check space
df -h /

# Clean up old models
ls -lh ~/.cache/gguf/

Quick Start Script

Save as ~/bin/gemma4-server:

#!/bin/bash
MODEL="${GEMMA4_MODEL:-$HOME/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf}"
PORT="${GEMMA4_PORT:-8082}"

if [ ! -f "$MODEL" ]; then
  echo "Model not found. Downloading..."
  mkdir -p ~/.cache/gguf
  curl -L -o "$MODEL" \
    "https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"
fi

echo "Starting Gemma 4 26B MOE server on port $PORT..."
exec llama-server \
  -m "$MODEL" \
  -ngl 99 \
  --host 0.0.0.0 \
  --port "$PORT"

Make it executable:

chmod +x ~/bin/gemma4-server

Resources


Last updated: 2026-04-14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment