Skip to content

Instantly share code, notes, and snippets.

@mhitza
Last active August 12, 2025 00:42
Show Gist options
  • Save mhitza/f5a8eeb298feb239de10f9f60f8411a7 to your computer and use it in GitHub Desktop.
Save mhitza/f5a8eeb298feb239de10f9f60f8411a7 to your computer and use it in GitHub Desktop.
RTX 4000 SFF Ada Generation 20 GB VRAM performance comparison to Framework Desktop

See geerlingguy/ai-benchmarks#21 (comment)

Llama 3.2 3b

command and output
$ wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           pp512 |      5447.12 ± 21.94 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |          pp4096 |       3451.02 ± 6.83 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           tg128 |        103.92 ± 0.34 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |    pp4096+tg128 |       1520.61 ± 0.97 |

build: 9515c613 (6097)
test t/s Framework Desktop t/s
pp512 5447.12 ± 21.94 1581.18 ± 10.80
pp4096 3451.02 ± 6.83 1059.94 ± 1.85
tg128 103.92 ± 0.34 88.14 ± 1.28
pp4096+tg128 1520.61 ± 0.97 652.67 ± 1.39

Qwen2.5 14b

command and output
$ wget https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           pp512 |      1252.52 ± 21.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |          pp4096 |        919.03 ± 0.74 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           tg128 |         27.23 ± 0.02 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |    pp4096+tg128 |        417.29 ± 1.54 |

build: 9515c613 (6097)
test t/s Framework Desktop t/s
pp512 1252.52 ± 21.25 321.48 ± 0.61
pp4096 919.03 ± 0.74 266.84 ± 0.17
tg128 27.23 ± 0.02 22.97 ± 0.16
pp4096+tg128 417.29 ± 1.54 184.66 ± 0.36

gpt-oss 20B

command and output
$ wget https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf

$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/gpt-oss-20b-F16.gguf --threads 32 -n 128 -p 512 -pg 512,128 -ngl 125 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           pp512 |      2006.79 ± 11.89 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           tg128 |         57.69 ± 0.06 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |     pp512+tg128 |        251.00 ± 0.29 |

build: 9515c613 (6097)
test t/s Framework Desktop t/s
pp512 2006.79 ± 11.89 564.23 ± 0.46
tg128 57.69 ± 0.06 45.01 ± 0.05
pp512+tg128 251.00 ± 0.29 167.77 ± 0.09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment