Context: testing flash attention from ggml-org/llama.cpp#5021
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch upstream
git checkout upstream/gg/flash-attn
git pull
make clean && make -j LLAMA_CURL=1 main llama-bench./llama-bench -m models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -fa 0,1 -p 512,1024 -n 128,256,512,1024M3 Pro 36gb 18gpu
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | pp 512 | 308.84 ± 1.40 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 128 | 24.88 ± 0.14 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 256 | 24.88 ± 0.16 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | pp 512 | 324.32 ± 1.02 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 128 | 25.75 ± 0.12 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 256 | 25.50 ± 0.33 |
M1 Ultra 128gb 64gpu
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | pp 512 | 824.02 ± 13.65 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | pp 1024 | 837.59 ± 5.64 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 128 | 66.80 ± 0.10 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 256 | 66.99 ± 0.12 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 512 | 66.48 ± 0.03 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 0 | tg 1024 | 65.23 ± 0.07 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | pp 512 | 905.47 ± 0.70 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | pp 1024 | 895.86 ± 0.62 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 128 | 70.67 ± 0.08 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 256 | 70.63 ± 0.03 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 512 | 70.31 ± 0.04 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Metal | 99 | 1 | tg 1024 | 69.68 ± 0.06 |
hyperfine --warmup 1 --runs 10 \
-L flag -fa, \
-L model models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
-L n_prompt 10,100,1000 \
-L n_predict 10,100,1000 \
--setup 'python -c "print(list(dict(abc=1).keys())[0] * {n_prompt})" > prompt-abc-{n_prompt}.txt' \
'./main {flag} -f prompt-abc-{n_prompt}.txt -n {n_predict} -m {model} --seed 123 --top_p 0.0 --top_k 1 -c 2500' \
--export-json results.json
python analyze_results.py results.jsonM1 Ultra 128gb 64gpu
| prompt \ n | 10 | 100 | 1000 |
|---|---|---|---|
| 10 | -0.68% | -1.43% | -5.33% |
| 100 | -0.70% | -3.88% | -5.57% |
| 1000 | -1.41% | -4.81% | -9.26% |
M3 Pro 36gb 18gpu
| prompt \ n | 10 | 100 | 1000 |
|---|---|---|---|
| 10 | -2.26% | -2.39% | -9.45% |
| 100 | -0.85% | -2.41% | -7.69% |
| 1000 | -5.23% | -8.13% | -9.61% |