Running llama.cpp on GPU on Intel Mac

I found a way to run llama.cpp with GPU support on my 2019 Intel MacBook Pro, with the help of Vulkan.

Thanks to ollama/ollama#1016 (comment) an following comments which helped me find the right build flags.

First download MoltenVK-macos.tar (I used v1.3.0-rc1 at the time) from https://github.com/KhronosGroup/MoltenVK/releases

Not sure all these packages are required, but I ran brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader. You probably also need to install cmake and potentially other tools I already had installed on my system.

Then clone and build llama.cpp (replace /path/to/MoltenVK_macos with the actual path you have MoltenVK_macos):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
-DVulkan_INCLUDE_DIR=/path/to/MoltenVK_macos/MoltenVK/include/ \
-DVulkan_LIBRARY=/path/to/MoltenVK_macos/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp \
-DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \
-DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \
-DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include"
cmake --build build --config Release -j 8

Then test run llama.cpp in CLI mode:

./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 16384 --flash-attn -ctv q4_0 -ctk q4_0
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   166.92 MiB
load_tensors:      Vulkan0 model buffer size =  1123.71 MiB

With that I see low CPU usage and 90% GPU usage. A ctx-size of 16384 has been working well for me (if you increase it, it will store some things in RAM instead of VRAM). The --flash-attn -ctv q4_0 -ctk q4_0 flags help with token/second performance.

I tried different small models that fit on my Radeon Pro 555X 4 GB and found that Phi-4-mini-instruct.Q5_K_M.gguf had the best performance and usability for me.

Phi-4-mini-instruct.Q5_K_M.gguf: pretty good performance, fully uses the GPU. Used it for summarizing ~20min Youtube video transcripts. As this is not a reasoning model, it doesn't break like other models below.
Qwen3-4B-Q6_K.gguf: observed GPU is at 10%, CPU idle, model is very slow
Phi-4-mini-reasoning-UD-Q4_K_XL.gguf: when reasoning, has a high chance of breaking (outputting gibberish) after a while (just asking it so summarize YouTube transcripts)
Tried Phi-4-mini-instruct (4B) from https://huggingface.co/MaziyarPanahi/Phi-4-mini-instruct-GGUF/blob/main/Phi-4-mini-instruct.Q5_K_M.gguf.
gemma-3n-E2B-it-GGUF: ran into decode: failed to find a memory slot for batch of size 2048 main : failed to eval, tried gemma-3n-E2B-it-UD-IQ2_XXS.gguf, same thing.
SmolLM3-3B: when thinking, has a chance of breaking (infinite loop printing the same word in my case)
- they say to use --jinja: ~/not_work/LLMs/llama.cpp_jul_2025 $ ./build/bin/llama-cli -m ../SmolLM3-3B-Q6_K.gguf --no-mmap -ngl 100 --jinja
Llama-3.2-1B-Instruct-Q4_K_M.gguf: okay, but general knowledge is poorer than Phi-4-mini-instruct in my experience

One thing to note is that when sending a large prompt (for example Youtube transcript + "Give me a bullet point list of the main things I should remember from this video transcript"), my Mac gets very slow until the first token is outputted, especially when other apps use the GPU. I noticed that especially when having the Brave browser showing up on the screen. Having Safari opened doesn't show that.

Other experiments

Can run on integrated graphics (will favor dedicated by default):

GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

--> pretty slow

Try running on both GPUs:

GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0 --main-gpu 0
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro 555X (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none

I noticed that:

uses only integrated GPU until first token out. Then uses integrated at 99% and dedicated at 5% or 10%
does 3.41 tok/s (versus using only AMD GPU: 10tok/s)

paoloaveri/README.md

Select an option

No results found

Select an option

No results found

Other experiments