Skip to content

Instantly share code, notes, and snippets.

@andrewginns
Last active January 12, 2024 10:06
Show Gist options
  • Save andrewginns/cf95867c2d79458bd894c07b9e7d5cbd to your computer and use it in GitHub Desktop.
Save andrewginns/cf95867c2d79458bd894c07b9e7d5cbd to your computer and use it in GitHub Desktop.
Comparison of LLM performance on varied hardware

Based on ggerganov/llama.cpp#4167

LLaMA 7B

PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"

BW
[GB/s]
GPU
Cores
F16 PP
[t/s]
F16 TG
[t/s]
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
✅ M1 Pro 16GB 200 14 262.65 12.75 235.16 21.95 232.55 35.52
M3 Pro 36GB 150 18 357.45 9.89 344.66 17.53 341.67 30.74
3070 mobile 8GB* 448 5120 14.35 0.82 56.12 39.48 1735.10 64.22

*8GB VRAM means F16 and PP for Q8_0 tests did not fit in memory

Description

This is a collection of short llama.cpp benchmarks on various hardware configutations. It can be useful to compare the performance that llama.cpp achieves across devices.

CPU and Apple Silicon (Metal)

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

CUDA

git checkout 8e672efe
make clean && LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef

Tests Used

model size params test
llama 7B mostly F16 12.55 GiB 6.74 B pp 512
llama 7B mostly F16 12.55 GiB 6.74 B tg 128
llama 7B mostly Q8_0 6.67 GiB 6.74 B pp 512
llama 7B mostly Q8_0 6.67 GiB 6.74 B tg 128
llama 7B mostly Q4_0 3.56 GiB 6.74 B pp 512
llama 7B mostly Q4_0 3.56 GiB 6.74 B tg 128
@andrewginns
Copy link
Author

andrewginns commented Jan 12, 2024

Actual performance in use is a mix of PP and TG processing. Comparing the M1 Pro and M3 Pro machines in the table above it can be see that the M1 Pro machine performs better in TG due to having higher memory bandwidth (200GB/s vs 150GB/s), the inverse is true in PP due to a GPU core count and architecture advantage for the M3 Pro.

In actual use with a 7B parameter OpenChat-3.5 model in LM-Studio the M3 Pro performs ~30% better than the M1 Pro on a single question and response despite this being the worst case balance between PP and TG, in a multi-turn chat you would see PP become larger relative to TG as the history of the chat becomes part of the prompt input.

M1 Pro 200GB/s 14 GPU Cores - 23.35W peak
m1-pro-2e-6p-14g-23 35w

M3 Pro 150GB/s 18 GPU Cores - 21.19W peak
m3-pro-6e-6p-18g-21 19w

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment