Skip to content

Instantly share code, notes, and snippets.

@tristan-k
Forked from paoloaveri/README.md
Created November 7, 2025 17:19
Show Gist options
  • Select an option

  • Save tristan-k/571b302d3370d42481c1406685c54088 to your computer and use it in GitHub Desktop.

Select an option

Save tristan-k/571b302d3370d42481c1406685c54088 to your computer and use it in GitHub Desktop.
Running llama.cpp on GPU on Intel Mac

I found a way to run llama.cpp with GPU support on my 2019 Intel MacBook Pro, with the help of Vulkan.

Thanks to ollama/ollama#1016 (comment) an following comments which helped me find the right build flags.

First download MoltenVK-macos.tar (I used v1.3.0-rc1 at the time) from https://github.com/KhronosGroup/MoltenVK/releases

Not sure all these packages are required, but I ran brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader. You probably also need to install cmake and potentially other tools I already had installed on my system.

Then clone and build llama.cpp (replace /path/to/MoltenVK_macos with the actual path you have MoltenVK_macos):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
-DVulkan_INCLUDE_DIR=/path/to/MoltenVK_macos/MoltenVK/include/ \
-DVulkan_LIBRARY=/path/to/MoltenVK_macos/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp \
-DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \
-DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \
-DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include"
cmake --build build --config Release -j 8

Then test run llama.cpp in CLI mode:

./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 16384 --flash-attn -ctv q4_0 -ctk q4_0
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   166.92 MiB
load_tensors:      Vulkan0 model buffer size =  1123.71 MiB

With that I see low CPU usage and 90% GPU usage. A ctx-size of 16384 has been working well for me (if you increase it, it will store some things in RAM instead of VRAM). The --flash-attn -ctv q4_0 -ctk q4_0 flags help with token/second performance.

I tried different small models that fit on my Radeon Pro 555X 4 GB and found that Phi-4-mini-instruct.Q5_K_M.gguf had the best performance and usability for me.

One thing to note is that when sending a large prompt (for example Youtube transcript + "Give me a bullet point list of the main things I should remember from this video transcript"), my Mac gets very slow until the first token is outputted, especially when other apps use the GPU. I noticed that especially when having the Brave browser showing up on the screen. Having Safari opened doesn't show that.

Other experiments

Can run on integrated graphics (will favor dedicated by default):

GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

--> pretty slow

Try running on both GPUs:

GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0 --main-gpu 0
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro 555X (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

I noticed that:

  • uses only integrated GPU until first token out. Then uses integrated at 99% and dedicated at 5% or 10%
  • does 3.41 tok/s (versus using only AMD GPU: 10tok/s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment