Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save AndreVallestero/48802c30bab42121f82e25ebfeb437f3 to your computer and use it in GitHub Desktop.
Save AndreVallestero/48802c30bab42121f82e25ebfeb437f3 to your computer and use it in GitHub Desktop.
llama-cpp-python vulkan windows setup
  1. Install Vulkan from https://vulkan.lunarg.com/sdk/home (only the core is required, leave everything unchecked)

  2. (Re)Install llama-cpp-python (python venv is recommended, but not necessary)

    For PowerShell

    $env:CMAKE_ARGS="-DGGML_VULKAN=1"
    pip install llama-cpp-python --no-cache-dir --force-reinstall -v

    For CMD/Batch

    set "CMAKE_ARGS=-DGGML_VULKAN=1"
    pip install llama-cpp-python --no-cache-dir --force-reinstall -v
  3. Import and initialize the model in your python script

    from llama_cpp import Llama
    llm = Llama(
        model_path=model_path,  # Path to your gguf model file
        verbose=True,           # True is needed for GPU
        n_gpu_layers=-1,        # -1 tries to load all layers into VRAM
        n_ctx=2048,             # Context window
    )
  4. Use the LLM

    Either using a basic prompt

    output = llm("Q: Name the planets in the solar system? A: ")

    Or, chat completion

    prompt = [
        {"role": "system", "content": f"You are a translator, respond to the user with the English translation of their sentence."},
        {"role": "user", "content": "圆圆"},
        {"role": "system", "content": "Yuanyuan"},
        {"role": "user", "content": "烧卖、虾饺"},
    ]
    output = llm.create_chat_completion(messages=prompt)
  5. Try out different models and parameters

    1. If you have 32GB of memory, I recommend Qwen2.5-14B-Instruct (or 32B on Linux)
    2. Make sure your chat_format parameter is correct, and try different temperature params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment