I found a way to run llama.cpp with GPU support on my 2019 Intel MacBook Pro, with the help of Vulkan.
Thanks to ollama/ollama#1016 (comment) an following comments which helped me find the right build flags.
First download MoltenVK-macos.tar (I used v1.3.0-rc1 at the time) from https://github.com/KhronosGroup/MoltenVK/releases
Not sure all these packages are required, but I ran brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader.
You probably also need to install cmake and potentially other tools I already had installed on my system.
Then clone and build llama.cpp (replace /path/to/MoltenVK_macos with the actual path you have MoltenVK_macos):
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
-DVulkan_INCLUDE_DIR=/path/to/MoltenVK_macos/MoltenVK/include/ \
-DVulkan_LIBRARY=/path/to/MoltenVK_macos/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp \
-DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \
-DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \
-DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include"
cmake --build build --config Release -j 8
Then test run llama.cpp in CLI mode:
./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 16384 --flash-attn -ctv q4_0 -ctk q4_0
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU model buffer size = 166.92 MiB
load_tensors: Vulkan0 model buffer size = 1123.71 MiB
With that I see low CPU usage and 90% GPU usage. A ctx-size of 16384 has been working well for me (if you increase it, it will store some things in RAM instead of VRAM). The --flash-attn -ctv q4_0 -ctk q4_0 flags help with token/second performance.
I tried different small models that fit on my Radeon Pro 555X 4 GB and found that Phi-4-mini-instruct.Q5_K_M.gguf had the best performance and usability for me.
- Phi-4-mini-instruct.Q5_K_M.gguf: pretty good performance, fully uses the GPU. Used it for summarizing ~20min Youtube video transcripts. As this is not a reasoning model, it doesn't break like other models below.
Qwen3-4B-Q6_K.gguf: observed GPU is at 10%, CPU idle, model is very slow- Phi-4-mini-reasoning-UD-Q4_K_XL.gguf: when reasoning, has a high chance of breaking (outputting gibberish) after a while (just asking it so summarize YouTube transcripts)
- Tried
Phi-4-mini-instruct(4B) from https://huggingface.co/MaziyarPanahi/Phi-4-mini-instruct-GGUF/blob/main/Phi-4-mini-instruct.Q5_K_M.gguf. - gemma-3n-E2B-it-GGUF: ran into
decode: failed to find a memory slot for batch of size 2048main : failed to eval, triedgemma-3n-E2B-it-UD-IQ2_XXS.gguf, same thing. - SmolLM3-3B: when thinking, has a chance of breaking (infinite loop printing the same word in my case)
- they say to use
--jinja:~/not_work/LLMs/llama.cpp_jul_2025 $ ./build/bin/llama-cli -m ../SmolLM3-3B-Q6_K.gguf --no-mmap -ngl 100 --jinja
- they say to use
- Llama-3.2-1B-Instruct-Q4_K_M.gguf: okay, but general knowledge is poorer than
Phi-4-mini-instructin my experience
One thing to note is that when sending a large prompt (for example Youtube transcript + "Give me a bullet point list of the main things I should remember from this video transcript"), my Mac gets very slow until the first token is outputted, especially when other apps use the GPU. I noticed that especially when having the Brave browser showing up on the screen. Having Safari opened doesn't show that.
Can run on integrated graphics (will favor dedicated by default):
GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
--> pretty slow
Try running on both GPUs:
GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-cli -m ../models/Phi-4-mini-instruct.Q5_K_M.gguf --no-mmap -ngl 100 --ctx-size 4096 --flash-attn -ctv q4_0 -ctk q4_0 --main-gpu 0
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro 555X (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
I noticed that:
- uses only integrated GPU until first token out. Then uses integrated at 99% and dedicated at 5% or 10%
- does 3.41 tok/s (versus using only AMD GPU: 10tok/s)