Skip to content

Instantly share code, notes, and snippets.

@richin13
Created March 27, 2026 05:10
Show Gist options
  • Select an option

  • Save richin13/65297e8db5595473c0b3ec2ad6b310ad to your computer and use it in GitHub Desktop.

Select an option

Save richin13/65297e8db5595473c0b3ec2ad6b310ad to your computer and use it in GitHub Desktop.

πŸ¦™ llama.cpp Cheatsheet


1. Installation

# macOS / Linux (easiest)
brew install llama.cpp

# Windows
winget install ggml-org.llama.cpp

# From source (with CUDA GPU support)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON        # drop -DGGML_CUDA=ON for CPU-only
cmake --build build --config Release

# Python bindings (alternative)
pip install llama-cpp-python
# with CUDA:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

πŸ“– Full installation guide β€” covers Metal, ROCm, Vulkan, Docker, etc.


2. Finding Models on Hugging Face

Where to browse

Resource Link
All GGUF models https://huggingface.co/models?library=gguf
ggml-org official quants https://huggingface.co/ggml-org
bartowski (prolific quantizer) https://huggingface.co/bartowski
unsloth (dynamic quants) https://huggingface.co/unsloth

Quantization picker

Quant Bits/Weight Use Case
Q4_K_M ~4.9 bpw Best balance of size vs quality (recommended default)
Q5_K_M ~5.7 bpw Higher quality, ~20% larger
Q6_K ~6.6 bpw High quality
Q8_0 ~8.5 bpw Near-original quality, 2Γ— size of Q4

RAM rule of thumb

RAM Model Size
8 GB up to 7B @ Q4_K_M
16 GB up to 13B @ Q4_K_M
32 GB up to 34B @ Q4_K_M
64 GB+ 70B+ @ Q4_K_M

πŸ“– Obtaining models guide


3. Running Models β€” One-shot Chat (CLI)

The -hf flag downloads + caches models automatically from Hugging Face:

# Simplest: auto-downloads and starts conversation mode
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Pick a specific quant
llama-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M

# With a system prompt
llama-cli -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
    -cnv -sys "You are a helpful coding assistant"

# Local file you already downloaded
llama-cli -m ./models/my-model.gguf -cnv

# Full GPU offload + large context
llama-cli -m model.gguf -cnv -ngl 99 -c 16384

Key CLI flags

Flag What it does
-hf user/repo:quant Download & run from HuggingFace
-m path.gguf Load a local GGUF file
-cnv Conversation mode (multi-turn chat)
-no-cnv Raw completion mode (no chat template)
-sys "prompt" Set system prompt
-ngl 99 Offload all layers to GPU
-c N Context window size (tokens)
-n N Max tokens to generate (-1 = infinite)
--temp N Temperature (0.0–2.0)
-p "text" Initial prompt (completion mode)

4. Serving Models via API (llama-server)

# Basic β€” starts OpenAI-compatible server on http://localhost:8080
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M

# Production-ish setup
llama-server \
    -m model.gguf \
    -ngl 99 \              # full GPU offload
    -c 8192 \              # context window
    -np 4 \                # 4 parallel request slots
    --host 0.0.0.0 \       # listen on all interfaces
    --port 8080 \
    -fa \                   # flash attention (faster)
    --metrics               # enable Prometheus metrics

# Multi-model router mode (no -m flag!)
llama-server --models-dir ./my-models --models-max 4

# Docker
docker run -p 8080:8080 -v /models:/models \
    ghcr.io/ggml-org/llama.cpp:server \
    -m /models/model.gguf -c 4096 --host 0.0.0.0

Key server flags

Flag What it does
-np N Number of parallel slots (concurrent users)
-fa Flash attention (recommended, saves VRAM)
--metrics Enable /metrics endpoint (Prometheus)
--api-key KEY Require API key for requests
--embedding Enable /v1/embeddings endpoint
--models-dir PATH Router mode: serve multiple GGUFs
--models-max N Max models loaded simultaneously (default 4)

πŸ“– Server README β€” full API documentation πŸ“– Multi-model management blog


5. Configuration Tips

Disable Thinking/Reasoning (for Qwen3, DeepSeek-R1, etc.)

# Option A: at server startup (global)
llama-server -m model.gguf --reasoning-budget 0

# Option B: at server startup via template kwargs
llama-server -m model.gguf \
    --chat-template-kwargs '{"enable_thinking": false}'

# Option C: per-request in the API body (see curl/python below)
#   add: "chat_template_kwargs": {"enable_thinking": false}

Context Window

# Set context size (default is usually 2048–4096)
llama-server -m model.gguf -c 16384

# For models trained with extended context + RoPE scaling
llama-server -m model.gguf -c 32768 --rope-scale 4

# Rule: bigger context = more RAM. Use flash attention to reduce VRAM:
llama-server -m model.gguf -c 32768 -fa

Temperature & Sampling

# Precise/deterministic output
llama-server -m model.gguf --temp 0.0

# Creative/diverse output
llama-server -m model.gguf --temp 1.2

These can also be set per-request in the API body (see Β§6 below):

Parameter Default Notes
temperature 0.8 0 = greedy/deterministic, >1 = more creative
top_p 0.95 Nucleus sampling threshold
top_k 40 Limit to top K tokens
min_p 0.05 Minimum token probability (relative to best)
repeat_penalty 1.1 Penalize repeated tokens
max_tokens / n_predict -1 Max tokens to generate
stream false Stream tokens as SSE events

Per-model presets (.ini file)

# my-models.ini β€” use with: llama-server --models-preset my-models.ini
[coding-model]
model = /models/qwen-coder-7b-q4.gguf
ctx-size = 16384
n-gpu-layers = 99
temp = 0.2

[chat-model]
model = /models/llama-3-8b-instruct-q5.gguf
ctx-size = 8192
n-gpu-layers = 99
temp = 0.7

6. Calling the API

The server exposes OpenAI-compatible endpoints β€” you can use the openai Python library, curl, or any OpenAI-compatible client.

With curl

Chat Completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Chat Completion β€” Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

Disable Thinking Per-Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Text Completion (non-chat)

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The meaning of life is",
    "max_tokens": 128,
    "temperature": 0.9
  }'

Health Check

curl http://localhost:8080/health

With Python (openai library β€” recommended)

pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="no-key-required",          # llama.cpp doesn't need a real key
)

# --- Chat completion ---
response = client.chat.completions.create(
    model="any-string",                 # model name is ignored (single-model mode)
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

# --- Streaming ---
stream = client.chat.completions.create(
    model="any-string",
    messages=[{"role": "user", "content": "Write a short poem about Rust."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

With Python (requests β€” raw HTTP)

import requests

resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.3,
        "max_tokens": 100,
        # Disable thinking for reasoning models:
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
data = resp.json()
print(data["choices"][0]["message"]["content"])

Multi-model Router Mode

When running in router mode (llama-server --models-dir ./models), specify which model in the model field:

response = client.chat.completions.create(
    model="ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",   # <-- matters now!
    messages=[{"role": "user", "content": "Hello!"}],
)
# Same with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

7. Quick-Reference: API Endpoints

Endpoint Method Purpose
/v1/chat/completions POST Chat (OpenAI-compatible)
/v1/completions POST Text completion
/v1/embeddings POST Embeddings (needs --embedding)
/v1/models GET List loaded models
/health GET Health check
/metrics GET Prometheus metrics (needs --metrics)
/slots GET Current slot/processing state

8. Common Recipes

# Fastest possible: small model, full GPU, flash attention
llama-server -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 -fa

# Big context coding assistant
llama-server -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
    -c 32768 -ngl 99 -fa --reasoning-budget 0

# Deterministic JSON extraction (temp=0, no thinking)
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M \
    --temp 0.0

# Multi-user chat server
llama-server -m model.gguf -ngl 99 -c 16384 -np 8 \
    --host 0.0.0.0 --port 8080 -fa --metrics

πŸ“š Key Documentation Links

Resource URL
GitHub repo https://github.com/ggml-org/llama.cpp
Server API docs (README) https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
Installation guide https://mintlify.com/ggml-org/llama.cpp/installation
Obtaining models https://mintlify.com/ggml-org/llama.cpp/models/obtaining-models
HF GGUF usage guide https://huggingface.co/docs/hub/gguf-llamacpp
Browse all GGUF models https://huggingface.co/models?library=gguf
llama-cpp-python (Python bindings) https://llama-cpp-python.readthedocs.io/
Multi-model management https://huggingface.co/blog/ggml-org/model-management-in-llamacpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment