buzz

Joined on May 31, 2008

60 followers · 36 following

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

DocShotgun / llamacpp-moe-offload-guide.md

Last active July 19, 2026 02:49

Guide to optimizing inference performance of large MoE models across CPU+GPU using llama.cpp and its derivatives

Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp

Introduction

So you want to try one of those fancy huge mixture-of-experts (MoE) models locally? Well, whether you've got a gaming PC or a large multi-GPU workstation, we've got you covered. As long as you've downloaded enough RAM beforehand.

Anatomy of a MoE Model

MoE models are described in terms of their total parameters and active parameters - i.e. DeepSeek V3 671B A37B has 671B total parameters, but we are using only 37B parameters at a time during each forward pass through the model.