🦙 llama.cpp Cheatsheet

1. Installation

# macOS / Linux (easiest)
brew install llama.cpp

# Windows
winget install ggml-org.llama.cpp

# From source (with CUDA GPU support)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON        # drop -DGGML_CUDA=ON for CPU-only
cmake --build build --config Release

# Python bindings (alternative)
pip install llama-cpp-python
# with CUDA:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

📖 Full installation guide — covers Metal, ROCm, Vulkan, Docker, etc.

2. Finding Models on Hugging Face

Where to browse

Resource	Link
All GGUF models	https://huggingface.co/models?library=gguf
ggml-org official quants	https://huggingface.co/ggml-org
bartowski (prolific quantizer)	https://huggingface.co/bartowski
unsloth (dynamic quants)	https://huggingface.co/unsloth

Quantization picker

Quant	Bits/Weight	Use Case
`Q4_K_M`	~4.9 bpw	Best balance of size vs quality (recommended default)
`Q5_K_M`	~5.7 bpw	Higher quality, ~20% larger
`Q6_K`	~6.6 bpw	High quality
`Q8_0`	~8.5 bpw	Near-original quality, 2× size of Q4

RAM rule of thumb

RAM	Model Size
8 GB	up to 7B @ Q4_K_M
16 GB	up to 13B @ Q4_K_M
32 GB	up to 34B @ Q4_K_M
64 GB+	70B+ @ Q4_K_M

📖 Obtaining models guide

3. Running Models — One-shot Chat (CLI)

The -hf flag downloads + caches models automatically from Hugging Face:

# Simplest: auto-downloads and starts conversation mode
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Pick a specific quant
llama-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M

# With a system prompt
llama-cli -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
    -cnv -sys "You are a helpful coding assistant"

# Local file you already downloaded
llama-cli -m ./models/my-model.gguf -cnv

# Full GPU offload + large context
llama-cli -m model.gguf -cnv -ngl 99 -c 16384

Key CLI flags

Flag	What it does
`-hf user/repo:quant`	Download & run from HuggingFace
`-m path.gguf`	Load a local GGUF file
`-cnv`	Conversation mode (multi-turn chat)
`-no-cnv`	Raw completion mode (no chat template)
`-sys "prompt"`	Set system prompt
`-ngl 99`	Offload all layers to GPU
`-c N`	Context window size (tokens)
`-n N`	Max tokens to generate (`-1` = infinite)
`--temp N`	Temperature (0.0–2.0)
`-p "text"`	Initial prompt (completion mode)

4. Serving Models via API (`llama-server`)

# Basic — starts OpenAI-compatible server on http://localhost:8080
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Production-ish setup
llama-server \
    -m model.gguf \
    -ngl 99 \              # full GPU offload
    -c 8192 \              # context window
    -np 4 \                # 4 parallel request slots
    --host 0.0.0.0 \       # listen on all interfaces
    --port 8080 \
    -fa \                   # flash attention (faster)
    --metrics               # enable Prometheus metrics

# Multi-model router mode (no -m flag!)
llama-server --models-dir ./my-models --models-max 4

# Docker
docker run -p 8080:8080 -v /models:/models \
    ghcr.io/ggml-org/llama.cpp:server \
    -m /models/model.gguf -c 4096 --host 0.0.0.0

Key server flags

Flag	What it does
`-np N`	Number of parallel slots (concurrent users)
`-fa`	Flash attention (recommended, saves VRAM)
`--metrics`	Enable `/metrics` endpoint (Prometheus)
`--api-key KEY`	Require API key for requests
`--embedding`	Enable `/v1/embeddings` endpoint
`--models-dir PATH`	Router mode: serve multiple GGUFs
`--models-max N`	Max models loaded simultaneously (default 4)

📖 Server README — full API documentation 📖 Multi-model management blog

5. Configuration Tips

Disable Thinking/Reasoning (for Qwen3, DeepSeek-R1, etc.)

# Option A: at server startup (global)
llama-server -m model.gguf --reasoning-budget 0

# Option B: at server startup via template kwargs
llama-server -m model.gguf \
    --chat-template-kwargs '{"enable_thinking": false}'

# Option C: per-request in the API body (see curl/python below)
#   add: "chat_template_kwargs": {"enable_thinking": false}

Context Window

# Set context size (default is usually 2048–4096)
llama-server -m model.gguf -c 16384

# For models trained with extended context + RoPE scaling
llama-server -m model.gguf -c 32768 --rope-scale 4

# Rule: bigger context = more RAM. Use flash attention to reduce VRAM:
llama-server -m model.gguf -c 32768 -fa

Temperature & Sampling

# Precise/deterministic output
llama-server -m model.gguf --temp 0.0

# Creative/diverse output
llama-server -m model.gguf --temp 1.2

These can also be set per-request in the API body (see §6 below):

Parameter	Default	Notes
`temperature`	0.8	0 = greedy/deterministic, >1 = more creative
`top_p`	0.95	Nucleus sampling threshold
`top_k`	40	Limit to top K tokens
`min_p`	0.05	Minimum token probability (relative to best)
`repeat_penalty`	1.1	Penalize repeated tokens
`max_tokens` / `n_predict`	-1	Max tokens to generate
`stream`	false	Stream tokens as SSE events

Per-model presets (`.ini` file)

# my-models.ini — use with: llama-server --models-preset my-models.ini
[coding-model]
model = /models/qwen-coder-7b-q4.gguf
ctx-size = 16384
n-gpu-layers = 99
temp = 0.2

[chat-model]
model = /models/llama-3-8b-instruct-q5.gguf
ctx-size = 8192
n-gpu-layers = 99
temp = 0.7

6. Calling the API

The server exposes OpenAI-compatible endpoints — you can use the openai Python library, curl, or any OpenAI-compatible client.

With `curl`

Chat Completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Chat Completion — Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

Disable Thinking Per-Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Text Completion (non-chat)

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The meaning of life is",
    "max_tokens": 128,
    "temperature": 0.9
  }'

Health Check

curl http://localhost:8080/health

With Python (`openai` library — recommended)

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="no-key-required",          # llama.cpp doesn't need a real key
)

# --- Chat completion ---
response = client.chat.completions.create(
    model="any-string",                 # model name is ignored (single-model mode)
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

# --- Streaming ---
stream = client.chat.completions.create(
    model="any-string",
    messages=[{"role": "user", "content": "Write a short poem about Rust."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

With Python (`requests` — raw HTTP)

import requests

resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.3,
        "max_tokens": 100,
        # Disable thinking for reasoning models:
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
data = resp.json()
print(data["choices"][0]["message"]["content"])

Multi-model Router Mode

When running in router mode (llama-server --models-dir ./models), specify which model in the model field:

response = client.chat.completions.create(
    model="ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",   # <-- matters now!
    messages=[{"role": "user", "content": "Hello!"}],
)

# Same with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

7. Quick-Reference: API Endpoints

Endpoint	Method	Purpose
`/v1/chat/completions`	POST	Chat (OpenAI-compatible)
`/v1/completions`	POST	Text completion
`/v1/embeddings`	POST	Embeddings (needs `--embedding`)
`/v1/models`	GET	List loaded models
`/health`	GET	Health check
`/metrics`	GET	Prometheus metrics (needs `--metrics`)
`/slots`	GET	Current slot/processing state

8. Common Recipes

# Fastest possible: small model, full GPU, flash attention
llama-server -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 -fa

# Big context coding assistant
llama-server -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
    -c 32768 -ngl 99 -fa --reasoning-budget 0

# Deterministic JSON extraction (temp=0, no thinking)
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M \
    --temp 0.0

# Multi-user chat server
llama-server -m model.gguf -ngl 99 -c 16384 -np 8 \
    --host 0.0.0.0 --port 8080 -fa --metrics

📚 Key Documentation Links

Resource	URL
GitHub repo	https://github.com/ggml-org/llama.cpp
Server API docs (README)	https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
Installation guide	https://mintlify.com/ggml-org/llama.cpp/installation
Obtaining models	https://mintlify.com/ggml-org/llama.cpp/models/obtaining-models
HF GGUF usage guide	https://huggingface.co/docs/hub/gguf-llamacpp
Browse all GGUF models	https://huggingface.co/models?library=gguf
llama-cpp-python (Python bindings)	https://llama-cpp-python.readthedocs.io/
Multi-model management	https://huggingface.co/blog/ggml-org/model-management-in-llamacpp

richin13/llama.cpp.md

Select an option

No results found

Select an option

No results found

🦙 llama.cpp Cheatsheet

1. Installation

2. Finding Models on Hugging Face

Where to browse

Quantization picker

RAM rule of thumb

3. Running Models — One-shot Chat (CLI)

Key CLI flags

4. Serving Models via API (`llama-server`)

Key server flags

5. Configuration Tips

Disable Thinking/Reasoning (for Qwen3, DeepSeek-R1, etc.)

Context Window

Temperature & Sampling

Per-model presets (`.ini` file)

6. Calling the API

With `curl`

Chat Completion

Chat Completion — Streaming

Disable Thinking Per-Request

Text Completion (non-chat)

Health Check

With Python (`openai` library — recommended)

With Python (`requests` — raw HTTP)

Multi-model Router Mode

7. Quick-Reference: API Endpoints

8. Common Recipes

📚 Key Documentation Links

richin13/llama.cpp.md

🦙 llama.cpp Cheatsheet

1. Installation

2. Finding Models on Hugging Face

Where to browse

Quantization picker

RAM rule of thumb

3. Running Models — One-shot Chat (CLI)

Key CLI flags

4. Serving Models via API (llama-server)

Key server flags

5. Configuration Tips

Disable Thinking/Reasoning (for Qwen3, DeepSeek-R1, etc.)

Context Window

Temperature & Sampling

Per-model presets (.ini file)

6. Calling the API

With curl

Chat Completion

Chat Completion — Streaming

Disable Thinking Per-Request

Text Completion (non-chat)

Health Check

With Python (openai library — recommended)

With Python (requests — raw HTTP)

Multi-model Router Mode

7. Quick-Reference: API Endpoints

8. Common Recipes

📚 Key Documentation Links

4. Serving Models via API (`llama-server`)

Per-model presets (`.ini` file)

With `curl`

With Python (`openai` library — recommended)

With Python (`requests` — raw HTTP)