Comparison of LLM performance on varied hardware

LLaMA 7B

PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 Pro 16GB	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M3 Pro 36GB	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ 3070 mobile 8GB*	448	5120	14.35	0.82	56.12	39.48	1735.10	64.22

*8GB VRAM means F16 and PP for Q8_0 tests did not fit in memory

Description

This is a collection of short llama.cpp benchmarks on various hardware configutations. It can be useful to compare the performance that llama.cpp achieves across devices.

CPU and Apple Silicon (Metal)

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

CUDA

git checkout 8e672efe
make clean && LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef

Tests Used

model	size	params	test
llama 7B mostly F16	12.55 GiB	6.74 B	pp 512
llama 7B mostly F16	12.55 GiB	6.74 B	tg 128
llama 7B mostly Q8_0	6.67 GiB	6.74 B	pp 512
llama 7B mostly Q8_0	6.67 GiB	6.74 B	tg 128
llama 7B mostly Q4_0	3.56 GiB	6.74 B	pp 512
llama 7B mostly Q4_0	3.56 GiB	6.74 B	tg 128

Actual performance in use is a mix of PP and TG processing. Comparing the M1 Pro and M3 Pro machines in the table above it can be see that the M1 Pro machine performs better in TG due to having higher memory bandwidth (200GB/s vs 150GB/s), the inverse is true in PP due to a GPU core count and architecture advantage for the M3 Pro.

In actual use with a 7B parameter OpenChat-3.5 model in LM-Studio the M3 Pro performs ~30% better than the M1 Pro on a single question and response despite this being the worst case balance between PP and TG, in a multi-turn chat you would see PP become larger relative to TG as the history of the chat becomes part of the prompt input.

M1 Pro 200GB/s 14 GPU Cores - 23.35W peak

M3 Pro 150GB/s 18 GPU Cores - 21.19W peak

andrewginns/llm_perf_comparison.md

LLaMA 7B

Description

Tests Used

andrewginns commented Jan 12, 2024 •

edited

Loading

Uh oh!

andrewginns/llm_perf_comparison.md

LLaMA 7B

Description

Tests Used

andrewginns commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewginns commented Jan 12, 2024 •

edited

Loading