Skip to content

Instantly share code, notes, and snippets.

@DocShotgun
DocShotgun / llamacpp-moe-offload-guide.md
Last active June 4, 2026 14:23
Guide to optimizing inference performance of large MoE models across CPU+GPU using llama.cpp and its derivatives

Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp

Introduction

So you want to try one of those fancy huge mixture-of-experts (MoE) models locally? Well, whether you've got a gaming PC or a large multi-GPU workstation, we've got you covered. As long as you've downloaded enough RAM beforehand.

Anatomy of a MoE Model

MoE models are described in terms of their total parameters and active parameters - i.e. DeepSeek V3 671B A37B has 671B total parameters, but we are using only 37B parameters at a time during each forward pass through the model.