You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The -hf flag downloads + caches models automatically from Hugging Face:
# Simplest: auto-downloads and starts conversation mode
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
# Pick a specific quant
llama-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
# With a system prompt
llama-cli -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
-cnv -sys "You are a helpful coding assistant"# Local file you already downloaded
llama-cli -m ./models/my-model.gguf -cnv
# Full GPU offload + large context
llama-cli -m model.gguf -cnv -ngl 99 -c 16384
# Option A: at server startup (global)
llama-server -m model.gguf --reasoning-budget 0
# Option B: at server startup via template kwargs
llama-server -m model.gguf \
--chat-template-kwargs '{"enable_thinking": false}'# Option C: per-request in the API body (see curl/python below)# add: "chat_template_kwargs": {"enable_thinking": false}
Context Window
# Set context size (default is usually 2048β4096)
llama-server -m model.gguf -c 16384
# For models trained with extended context + RoPE scaling
llama-server -m model.gguf -c 32768 --rope-scale 4
# Rule: bigger context = more RAM. Use flash attention to reduce VRAM:
llama-server -m model.gguf -c 32768 -fa
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{ "prompt": "The meaning of life is", "max_tokens": 128, "temperature": 0.9 }'
Health Check
curl http://localhost:8080/health
With Python (openai library β recommended)
pip install openai
fromopenaiimportOpenAIclient=OpenAI(
base_url="http://localhost:8080/v1",
api_key="no-key-required", # llama.cpp doesn't need a real key
)
# --- Chat completion ---response=client.chat.completions.create(
model="any-string", # model name is ignored (single-model mode)messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 3 sentences."},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
# --- Streaming ---stream=client.chat.completions.create(
model="any-string",
messages=[{"role": "user", "content": "Write a short poem about Rust."}],
stream=True,
)
forchunkinstream:
delta=chunk.choices[0].delta.contentifdelta:
print(delta, end="", flush=True)
With Python (requests β raw HTTP)
importrequestsresp=requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.3,
"max_tokens": 100,
# Disable thinking for reasoning models:"chat_template_kwargs": {"enable_thinking": False},
},
)
data=resp.json()
print(data["choices"][0]["message"]["content"])
Multi-model Router Mode
When running in router mode (llama-server --models-dir ./models), specify which model in the model field: