llama-cpp-python vulkan windows setup

Install Vulkan from https://vulkan.lunarg.com/sdk/home (only the core is required, leave everything unchecked)

(Re)Install llama-cpp-python (python venv is recommended, but not necessary)

For PowerShell

$env:CMAKE_ARGS="-DGGML_VULKAN=1"
pip install llama-cpp-python --no-cache-dir --force-reinstall -v

For CMD/Batch

set "CMAKE_ARGS=-DGGML_VULKAN=1"
pip install llama-cpp-python --no-cache-dir --force-reinstall -v

Import and initialize the model in your python script

from llama_cpp import Llama
llm = Llama(
    model_path=model_path,  # Path to your gguf model file
    verbose=True,           # True is needed for GPU
    n_gpu_layers=-1,        # -1 tries to load all layers into VRAM
    n_ctx=2048,             # Context window
)

Use the LLM

Either using a basic prompt

output = llm("Q: Name the planets in the solar system? A: ")

Or, chat completion

prompt = [
    {"role": "system", "content": f"You are a translator, respond to the user with the English translation of their sentence."},
    {"role": "user", "content": "圆圆"},
    {"role": "system", "content": "Yuanyuan"},
    {"role": "user", "content": "烧卖、虾饺"},
]
output = llm.create_chat_completion(messages=prompt)

Try out different models and parameters
1. If you have 32GB of memory, I recommend Qwen2.5-14B-Instruct (or 32B on Linux)
2. Make sure your chat_format parameter is correct, and try different temperature params

AndreVallestero/llama-cpp-python-vulkan-windows-setup.md