Based on ggerganov/llama.cpp#4167
PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
BW [GB/s] |
GPU Cores |
F16 PP [t/s] |
F16 TG [t/s] |
Q8_0 PP [t/s] |
Q8_0 TG [t/s] |
Q4_0 PP [t/s] |
Q4_0 TG [t/s] |
|
---|---|---|---|---|---|---|---|---|
✅ M1 Pro 16GB | 200 | 14 | 262.65 | 12.75 | 235.16 | 21.95 | 232.55 | 35.52 |
✅ M3 Pro 36GB | 150 | 18 | 357.45 | 9.89 | 344.66 | 17.53 | 341.67 | 30.74 |
✅ 3070 mobile 8GB* | 448 | 5120 | 14.35 | 0.82 | 56.12 | 39.48 | 1735.10 | 64.22 |
*8GB VRAM means F16 and PP for Q8_0 tests did not fit in memory
This is a collection of short llama.cpp benchmarks on various hardware configutations. It can be useful to compare the performance that llama.cpp achieves across devices.
CPU and Apple Silicon (Metal)
git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
-m ./models/llama-7b-v2/ggml-model-f16.gguf \
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
-p 512 -n 128 -ngl 99 2> /dev/null
CUDA
git checkout 8e672efe
make clean && LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench \
-m ./models/llama-7b-v2/ggml-model-f16.gguf \
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
-p 512 -n 128 -ngl 99 2> /dev/null
Make sure to run the benchmark on commit 8e672ef
model | size | params | test |
---|---|---|---|
llama 7B mostly F16 | 12.55 GiB | 6.74 B | pp 512 |
llama 7B mostly F16 | 12.55 GiB | 6.74 B | tg 128 |
llama 7B mostly Q8_0 | 6.67 GiB | 6.74 B | pp 512 |
llama 7B mostly Q8_0 | 6.67 GiB | 6.74 B | tg 128 |
llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | pp 512 |
llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | tg 128 |
Actual performance in use is a mix of PP and TG processing. Comparing the M1 Pro and M3 Pro machines in the table above it can be see that the M1 Pro machine performs better in TG due to having higher memory bandwidth (200GB/s vs 150GB/s), the inverse is true in PP due to a GPU core count and architecture advantage for the M3 Pro.
In actual use with a 7B parameter OpenChat-3.5 model in LM-Studio the M3 Pro performs ~30% better than the M1 Pro on a single question and response despite this being the worst case balance between PP and TG, in a multi-turn chat you would see PP become larger relative to TG as the history of the chat becomes part of the prompt input.
M1 Pro 200GB/s 14 GPU Cores - 23.35W peak
M3 Pro 150GB/s 18 GPU Cores - 21.19W peak