-
Install Vulkan from https://vulkan.lunarg.com/sdk/home (only the core is required, leave everything unchecked)
-
(Re)Install llama-cpp-python (python venv is recommended, but not necessary)
For PowerShell
$env:CMAKE_ARGS="-DGGML_VULKAN=1" pip install llama-cpp-python --no-cache-dir --force-reinstall -v
For CMD/Batch
set "CMAKE_ARGS=-DGGML_VULKAN=1" pip install llama-cpp-python --no-cache-dir --force-reinstall -v
-
Import and initialize the model in your python script
from llama_cpp import Llama llm = Llama( model_path=model_path, # Path to your gguf model file verbose=True, # True is needed for GPU n_gpu_layers=-1, # -1 tries to load all layers into VRAM n_ctx=2048, # Context window )
-
Use the LLM
Either using a basic prompt
output = llm("Q: Name the planets in the solar system? A: ")
Or, chat completion
prompt = [ {"role": "system", "content": f"You are a translator, respond to the user with the English translation of their sentence."}, {"role": "user", "content": "圆圆"}, {"role": "system", "content": "Yuanyuan"}, {"role": "user", "content": "烧卖、虾饺"}, ] output = llm.create_chat_completion(messages=prompt)
-
Try out different models and parameters
- If you have 32GB of memory, I recommend Qwen2.5-14B-Instruct (or 32B on Linux)
- Make sure your chat_format parameter is correct, and try different temperature params
Created
March 5, 2025 22:24
-
-
Save AndreVallestero/48802c30bab42121f82e25ebfeb437f3 to your computer and use it in GitHub Desktop.
llama-cpp-python vulkan windows setup
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment