See geerlingguy/ai-benchmarks#21 (comment)
command and output
$ wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           pp512 |      5447.12 ± 21.94 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |          pp4096 |       3451.02 ± 6.83 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |           tg128 |        103.92 ± 0.34 |
| llama 3B Q4_K - Medium         |   1.87 GiB |     3.21 B | CUDA       |  99 |    pp4096+tg128 |       1520.61 ± 0.97 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s | 
|---|---|---|
| pp512 | 5447.12 ± 21.94 | 1581.18 ± 10.80 | 
| pp4096 | 3451.02 ± 6.83 | 1059.94 ± 1.85 | 
| tg128 | 103.92 ± 0.34 | 88.14 ± 1.28 | 
| pp4096+tg128 | 1520.61 ± 0.97 | 652.67 ± 1.39 | 
command and output
$ wget https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf
  
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           pp512 |      1252.52 ± 21.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |          pp4096 |        919.03 ± 0.74 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |           tg128 |         27.23 ± 0.02 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | CUDA       |  99 |    pp4096+tg128 |        417.29 ± 1.54 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s | 
|---|---|---|
| pp512 | 1252.52 ± 21.25 | 321.48 ± 0.61 | 
| pp4096 | 919.03 ± 0.74 | 266.84 ± 0.17 | 
| tg128 | 27.23 ± 0.02 | 22.97 ± 0.16 | 
| pp4096+tg128 | 417.29 ± 1.54 | 184.66 ± 0.36 | 
command and output
$ wget https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/gpt-oss-20b-F16.gguf --threads 32 -n 128 -p 512 -pg 512,128 -ngl 125 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           pp512 |      2006.79 ± 11.89 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |           tg128 |         57.69 ± 0.06 |
| gpt-oss ?B F16                 |  12.83 GiB |    20.91 B | CUDA       | 125 |      32 |     pp512+tg128 |        251.00 ± 0.29 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s | 
|---|---|---|
| pp512 | 2006.79 ± 11.89 | 564.23 ± 0.46 | 
| tg128 | 57.69 ± 0.06 | 45.01 ± 0.05 | 
| pp512+tg128 | 251.00 ± 0.29 | 167.77 ± 0.09 |