RTX 4000 SFF Ada Generation 20 GB VRAM performance comparison to Framework Desktop

See geerlingguy/ai-benchmarks#21 (comment)

Llama 3.2 3b

command and output

$ wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           pp512 |      5447.12 ± 21.94 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |          pp4096 |       3451.02 ± 6.83 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           tg128 |        103.92 ± 0.34 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |    pp4096+tg128 |       1520.61 ± 0.97 |

build: 9515c613 (6097)

test	t/s	Framework Desktop t/s
pp512	5447.12 ± 21.94	1581.18 ± 10.80
pp4096	3451.02 ± 6.83	1059.94 ± 1.85
tg128	103.92 ± 0.34	88.14 ± 1.28
pp4096+tg128	1520.61 ± 0.97	652.67 ± 1.39

Qwen2.5 14b

command and output

$ wget https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           pp512 |      1252.52 ± 21.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |          pp4096 |        919.03 ± 0.74 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           tg128 |         27.23 ± 0.02 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |    pp4096+tg128 |        417.29 ± 1.54 |

build: 9515c613 (6097)

test	t/s	Framework Desktop t/s
pp512	1252.52 ± 21.25	321.48 ± 0.61
pp4096	919.03 ± 0.74	266.84 ± 0.17
tg128	27.23 ± 0.02	22.97 ± 0.16
pp4096+tg128	417.29 ± 1.54	184.66 ± 0.36

gpt-oss 20B

command and output

$ wget https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf

$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/gpt-oss-20b-F16.gguf --threads 32 -n 128 -p 512 -pg 512,128 -ngl 125 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           pp512 |      2006.79 ± 11.89 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           tg128 |         57.69 ± 0.06 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |     pp512+tg128 |        251.00 ± 0.29 |

build: 9515c613 (6097)

test	t/s	Framework Desktop t/s
pp512	2006.79 ± 11.89	564.23 ± 0.46
tg128	57.69 ± 0.06	45.01 ± 0.05
pp512+tg128	251.00 ± 0.29	167.77 ± 0.09

mhitza/Results.md

Llama 3.2 3b

Qwen2.5 14b

gpt-oss 20B