Skip to content

Instantly share code, notes, and snippets.

@ubergarm
Last active August 23, 2025 04:18
Show Gist options
  • Save ubergarm/0f9663fd56fc181a00ec9f634635eb38 to your computer and use it in GitHub Desktop.
Save ubergarm/0f9663fd56fc181a00ec9f634635eb38 to your computer and use it in GitHub Desktop.
Qwen3 235B and 30B MoE Quant Benchmarking Roundup

The Great Quant Wars of 2025

"All things leave behind them the Obscurity... and go forward to embrace the Brightness..." — Dao De Jing #42

tl;dr;

  • Q: Who provides the best GGUFs now?
  • A: They're all pretty good.

Skip down if you just want graphs and numbers comparing various Qwen3-30B-A3B GGUF quants.

Background

It's been well over a year since TheBloke uploaded his last quant to huggingface. The LLM landscape has changed markedly since then with many new models being released monthly, new inference engines targeting specific hardware optimizations, and ongoing evolution of quantization algorithims. Our community continues to grow and diversify at an amazing rate.

Fortunately, many folks and organizations have kindly stepped-up to keep the quants cooking so we can all find an LLM sized just right to fit on our home rigs. Amongst them bartowski, and unsloth (Daniel and Michael's start-up company), have become the new "household names" for providing a variety of GGUF quantizations for popular model releases and even all those wild creative fine-tunes! (There are many more including team mradermacher and too many to list everyone, sorry!)

Until recently most GGUF style quants' recipes were "static" meaning that all the tensors and layers were quantized the same e.g. Q8_0 or with consistent patterns defined in llama.cpp's code. So all quants of a given size were mostly the same regardless of who cooked and uploaded it to huggingface.

Things began to change over a year ago with major advancements like importance matrix quantizations by ikawrakow in llama.cpp PR#4861 as well as new quant types (like the perennial favorite IQ4_XS) which have become the mainstay for users of llama.cpp, ollama, koboldcpp, lmstudio, etc. The entire GGUF ecosystem owes a big thanks to not just to ggerganov but also ikawrakow (as well as the many more contributors).

Very recently unsloth introduced a few changes to their quantization methodology that combine different imatrix calibration texts and context lengths along with making some tensors/layers different sizes than the regular llama.cpp code (they had a public fork with their branch, but have to update and re-push due to upstream changes). They have named this change in standard methodology Unsloth Dynamic 2.0 GGUFs as part of their start-up company's marketing strategy.

Around the same time bartowski has been experimenting with different imatrix calibration texts and opened a PR to llama.cpp modifying the default tensor/layer quantization recipes. I myself began experimenting with custom "dynamic" quantization recipes using ikawrakow's latest SOTA quants like iq4_k which to-date only work on his ik_llama.cpp fork.

While this is great news for all GGUF enjoyers, the friendly competition and additional options have led to some confusion and I dare say some "tribalism". (If part of your identity as a person depends on downloading quants from only one source, I suggest you google: "Nan Yar?").

So how can you, dear reader, decide which is the best quant of a given model for you to download? unsloth already did a great blog post discussing their own benchmarks and metrics. Open a tab to check out u/AaronFeng47's many other benchmarks. And finally, this post contains even more metrics and benchmarks. The best answer I have is "Nullius in verba, (Latin for "take nobody's word for it") — even my word!

Unfortunately, this means there is no one-size-fits-all rule, "X" is not always better than "Y", and if you want to min-max-optimize your LLM for your specific use case on your specific hardware you probably will have to experiment and think critically. If you don't care too much, then pick the any of biggest quants that fit on your rig for the desired context length and you'll be fine because: they're all pretty good.

And with that, let's dive into the Qwen3-30B-A3B benchmarks below!

Quick Thanks

Shout out to Wendell and the Level1Techs crew, the L1T Forums, and the L1T YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make great quants available to the community!!!

Graphs

👈 Qwen3-30B-A3B Benchmark Suite Graphs

Note <think> mode was disabled for these tests to speed up benchmarking.

evalchemy-grouped-by-task

evalchemy-grouped-by-model

👈 Qwen3-30B-A3B Perplexity and KLD Graphs

Using the BF16 as baseline for KLD stats. Also note the perplexity was lowest ("best") for models other than the bf16 which is not typically the case unless there was possibly some QAT going on. As such, the chart is relative to the lowest perplexity score: PPL/min(PPL)-1 plus a small eps for scaling.

Perplexity

wiki.test.raw (lower is "better")

results-30B-ppl-wiki

ubergarm-kdl-test-corpus.txt (lower is "better")

results-30B-ppl-ubergarm

KLD Stats

(lower is "better")

results-30B-kld

Δp Stats

(lower is "better")

results-30B-deltap

👈 Qwen3-235B-A22B Perplexity and KLD Graphs

Not as many data points here but just for comparison. Keep in mind the Q8_0 was the baseline for KLD stats given I couldn't easily run the full BF16.

Perplexity

wiki.test.raw (lower is "better")

results-235B-ppl-wiki

ubergarm-kdl-test-corpus.txt (lower is "better")

results-235B-ppl-ubergarm

KLD Stats

(lower is "better")

results-235B-kld

Δp Stats

(lower is "better")

results-235B-deltap

👈 Qwen3-30B-A3B Speed llama-sweep-bench Graphs

llama-sweep-bench

llama.cpp

Qwen3-30B-A3B-mainline-gguf-roundup

ik_llama.cpp

NOTE: Keep in mind ik's fork is faster than mainline llama.cpp for many architectures and configurations especially only-CPU, hybrid-CPU+GPU, and DeepSeek MLA cases.

qwen3-30b-ik-sweep

Methodology

👈 Perplexity, KLD, and imatrix Methodology

PPL and KLD testing done with ik_llama.cpp@9ba36270.

Perplexity

I adjust ngl and threads for larger 235B models.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

KLD

I adjust ngl and threads for larger 235B models. For 235B I had to use the Q8_0 as the baseline given this rig can't easily run the full 400+GiB BF16.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

imatrix

This is how I make my imatrix using ik_llama.cpp to additionaly print out cosine similarity data to inform possible custom quant strategies. I haven't seen how exactly unsloth makes their new recipe.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-imatrix \
    --verbosity 1 \
    --layer-similarity \
    -m /mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf \
    -f calibration_data_v5_rc.txt \
    -o /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/imatrix-Qwen3-30B-A3B.dat \
    --ctx-size 512 \
    -ngl 36 \
    --threads 16

======================== sorted layer importances
  0: Layer   0, <cos_sim> = 0.32154
  1: Layer  47, <cos_sim> = 0.38473
  2: Layer   1, <cos_sim> = 0.736987
  3: Layer  28, <cos_sim> = 0.845492
  4: Layer   2, <cos_sim> = 0.847391
  5: Layer  29, <cos_sim> = 0.859291
  6: Layer   7, <cos_sim> = 0.861405
  7: Layer   3, <cos_sim> = 0.878313
  8: Layer   8, <cos_sim> = 0.893971
  9: Layer   6, <cos_sim> = 0.900308
 10: Layer  42, <cos_sim> = 0.911525
 11: Layer   5, <cos_sim> = 0.912156
 12: Layer  17, <cos_sim> = 0.913169
 13: Layer   4, <cos_sim> = 0.914095
 14: Layer  13, <cos_sim> = 0.92175
 15: Layer  46, <cos_sim> = 0.925283
 16: Layer  19, <cos_sim> = 0.926845
 17: Layer  18, <cos_sim> = 0.927019
 18: Layer  45, <cos_sim> = 0.928896
 19: Layer  40, <cos_sim> = 0.934481
 20: Layer  31, <cos_sim> = 0.934585
 21: Layer  14, <cos_sim> = 0.936932
 22: Layer  16, <cos_sim> = 0.940338
 23: Layer  25, <cos_sim> = 0.940477
 24: Layer  10, <cos_sim> = 0.942312
 25: Layer  38, <cos_sim> = 0.943166
 26: Layer   9, <cos_sim> = 0.943843
 27: Layer  11, <cos_sim> = 0.944233
 28: Layer  37, <cos_sim> = 0.944325
 29: Layer  20, <cos_sim> = 0.94612
 30: Layer  22, <cos_sim> = 0.946449
 31: Layer  41, <cos_sim> = 0.946775
 32: Layer  39, <cos_sim> = 0.947228
 33: Layer  44, <cos_sim> = 0.947687
 34: Layer  30, <cos_sim> = 0.947942
 35: Layer  23, <cos_sim> = 0.949102
 36: Layer  12, <cos_sim> = 0.951618
 37: Layer  21, <cos_sim> = 0.951701
 38: Layer  24, <cos_sim> = 0.952261
 39: Layer  43, <cos_sim> = 0.953357
 40: Layer  27, <cos_sim> = 0.953528
 41: Layer  26, <cos_sim> = 0.95575
 42: Layer  32, <cos_sim> = 0.956024
 43: Layer  15, <cos_sim> = 0.956915
 44: Layer  35, <cos_sim> = 0.959861
 45: Layer  36, <cos_sim> = 0.960591
 46: Layer  34, <cos_sim> = 0.961539
 47: Layer  33, <cos_sim> = 0.968161

======================== sorted attention importances
  0: Layer   0, <cos_sim> = 0.353019
  1: Layer  45, <cos_sim> = 0.638476
  2: Layer   1, <cos_sim> = 0.674894
  3: Layer  29, <cos_sim> = 0.686547
  4: Layer  17, <cos_sim> = 0.708034
  5: Layer   3, <cos_sim> = 0.718456
  6: Layer  21, <cos_sim> = 0.72082
  7: Layer  44, <cos_sim> = 0.732611
  8: Layer  22, <cos_sim> = 0.738435
  9: Layer  18, <cos_sim> = 0.742531
 10: Layer  42, <cos_sim> = 0.745018
 11: Layer   8, <cos_sim> = 0.746792
 12: Layer  24, <cos_sim> = 0.750162
 13: Layer  23, <cos_sim> = 0.750384
 14: Layer   9, <cos_sim> = 0.754324
 15: Layer  46, <cos_sim> = 0.758528
 16: Layer  33, <cos_sim> = 0.76019
 17: Layer  47, <cos_sim> = 0.760449
 18: Layer  27, <cos_sim> = 0.760966
 19: Layer   4, <cos_sim> = 0.761774
 20: Layer   2, <cos_sim> = 0.762337
 21: Layer   6, <cos_sim> = 0.763453
 22: Layer  34, <cos_sim> = 0.765167
 23: Layer  30, <cos_sim> = 0.768629
 24: Layer  25, <cos_sim> = 0.768819
 25: Layer  26, <cos_sim> = 0.769841
 26: Layer  20, <cos_sim> = 0.77039
 27: Layer  10, <cos_sim> = 0.772251
 28: Layer  41, <cos_sim> = 0.773975
 29: Layer  35, <cos_sim> = 0.774599
 30: Layer  43, <cos_sim> = 0.775401
 31: Layer  11, <cos_sim> = 0.776914
 32: Layer  28, <cos_sim> = 0.778543
 33: Layer  19, <cos_sim> = 0.781975
 34: Layer  36, <cos_sim> = 0.78645
 35: Layer  32, <cos_sim> = 0.790626
 36: Layer  15, <cos_sim> = 0.795375
 37: Layer  12, <cos_sim> = 0.797279
 38: Layer  16, <cos_sim> = 0.797483
 39: Layer  14, <cos_sim> = 0.797921
 40: Layer   7, <cos_sim> = 0.80098
 41: Layer   5, <cos_sim> = 0.802361
 42: Layer  37, <cos_sim> = 0.805299
 43: Layer  13, <cos_sim> = 0.806054
 44: Layer  31, <cos_sim> = 0.807454
 45: Layer  38, <cos_sim> = 0.808983
 46: Layer  40, <cos_sim> = 0.813216
 47: Layer  39, <cos_sim> = 0.816557

======================== sorted ffn importances
  0: Layer  47, <cos_sim> = 0.613059
  1: Layer  44, <cos_sim> = 0.630819
  2: Layer   0, <cos_sim> = 0.653987
  3: Layer  28, <cos_sim> = 0.686159
  4: Layer  16, <cos_sim> = 0.693473
  5: Layer   7, <cos_sim> = 0.694612
  6: Layer  43, <cos_sim> = 0.710648
  7: Layer  20, <cos_sim> = 0.71511
  8: Layer  21, <cos_sim> = 0.715567
  9: Layer  46, <cos_sim> = 0.71785
 10: Layer  45, <cos_sim> = 0.718143
 11: Layer   1, <cos_sim> = 0.726385
 12: Layer   3, <cos_sim> = 0.735632
 13: Layer   8, <cos_sim> = 0.736597
 14: Layer   2, <cos_sim> = 0.737616
 15: Layer  22, <cos_sim> = 0.739272
 16: Layer  33, <cos_sim> = 0.739951
 17: Layer  19, <cos_sim> = 0.740003
 18: Layer   9, <cos_sim> = 0.742748
 19: Layer  32, <cos_sim> = 0.747542
 20: Layer  23, <cos_sim> = 0.749229
 21: Layer  24, <cos_sim> = 0.755807
 22: Layer  41, <cos_sim> = 0.75653
 23: Layer  10, <cos_sim> = 0.757337
 24: Layer  34, <cos_sim> = 0.758472
 25: Layer  31, <cos_sim> = 0.759585
 26: Layer  40, <cos_sim> = 0.763913
 27: Layer  17, <cos_sim> = 0.768032
 28: Layer  26, <cos_sim> = 0.768999
 29: Layer  18, <cos_sim> = 0.771782
 30: Layer   6, <cos_sim> = 0.776553
 31: Layer   4, <cos_sim> = 0.777394
 32: Layer  27, <cos_sim> = 0.777827
 33: Layer  35, <cos_sim> = 0.778635
 34: Layer  42, <cos_sim> = 0.779552
 35: Layer  36, <cos_sim> = 0.779963
 36: Layer  25, <cos_sim> = 0.785371
 37: Layer  12, <cos_sim> = 0.785794
 38: Layer  29, <cos_sim> = 0.787757
 39: Layer   5, <cos_sim> = 0.79259
 40: Layer  11, <cos_sim> = 0.793774
 41: Layer  15, <cos_sim> = 0.796992
 42: Layer  30, <cos_sim> = 0.797935
 43: Layer  14, <cos_sim> = 0.7999
 44: Layer  39, <cos_sim> = 0.806665
 45: Layer  38, <cos_sim> = 0.813561
 46: Layer  13, <cos_sim> = 0.820982
 47: Layer  37, <cos_sim> = 0.830343
👈 Benchmarking Methodology

Benchmark Suite

The benchmark client used is bartowski's patched evalchemy fork containing fixes for easier use across a variety of LLM server API endpoints.

Benchmark test suite testing done with llama.cpp@36667c8e on a subset of models.

For llama.cpp server:

cd llama.cpp
git checkout 36667c8e
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

model=/mnt/raid/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf
name=bartowski/Qwen3-30B-A3B-IQ2_M

CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
  --model "$model" \
  --alias "$name" \
  --api-key super-secret-change-me \
  -fa \
  -ctk f16 -ctv f16 \
  -c 262144 \
  --parallel 8 \
  --slots \
  -ngl 99 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8088

For ik_llama.cpp server:

cd ik_llama.cpp
git checkout e3fec173
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
name=ubergarm/Qwen3-30B-A3B-mix-IQ4_K

CUDA_VISIBLE_DEVICES="1" \
./build/bin/llama-server \
  --model "$model" \
  --alias "$name" \
  --api-key super-secret-change-me \
  -fmoe \
  -fa \
  -ctk f16 -ctv f16 \
  -c 262144 \
  --parallel 8 \
  -ngl 99 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8088

For vllm server:

CUDA_VISIBLE_DEVICES="1" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
  serve swift/Qwen3-30B-A3B-AWQ \
  --served-model-name Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --api-key super-secret-change-me \
  --host 127.0.0.1 \
  --port 8080
👈 Speed Benchmark Methodology Note probably no warmup, I saw a PR on ik's fork about it so the first data point trends low.
cd llama.cpp
git ug/port-sweep-bench
# llama.cpp@814f795e + ug/port-sweep-bench
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q2_K_L.gguf
#model=/mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-IQ2_M.gguf

#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q2_K_XL.gguf
#model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-IQ2_M.gguf
model=/mnt/astrodata/llm/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf

CUDA_VISIBLE_DEVICE=0 \
./build/bin/llama-sweep-bench \
    --model "$model" \
    -fa \
    -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 1 \

Raw Data

👈 Perplexity, KLD, and Δp Raw Data Table

Parsed this data from a bunch of logs generated above. It is not in the most beautiful order so feel free to copy paste into google docs or however you'd like to make your own graphs.

Model Size 0.1% Δp 1.0% KLD 1.0% Δp 10.0% KLD 10.0% Δp 25.0% Δp 5.0% KLD 5.0% Δp 75.0% Δp 90.0% Δp 95.0% Δp 99.0% KLD 99.0% Δp 99.9% KLD 99.9% Δp Maximum KLD Maximum Δp Mean KLD Mean KLD uncertainty Mean Δp Mean Δp uncertainty Mean PPL(Q) ubergarm-kld-test-corpus.txt Mean PPL(Q) uncertainty ubergarm-kld-test-corpus.txt Median KLD Median Δp Minimum KLD Minimum Δp PPL uncertainty wiki.test.raw PPL wiki.test.raw RMS Δp RMS Δp uncertainty Same top p Same top p uncertainty
Qwen/Qwen3-235B-A22B-BF16 438
ubergarm/Qwen3-235B-A22B-Q8_0 233 11.7194 0.07212 0.03321 5.3141
ubergarm/Qwen3-235B-A22B-mix-IQ3_K 107 -18.276% 0.000036 -8.542% 0.000940 -2.631% -0.686% 0.000310 -4.272% 0.587% 2.504% 4.175% 0.098368 8.257% 0.296680 17.122% 2.906263 63.764% 0.014594 0.000064 -0.049 0.006 11.788282 0.072648 0.008979 -0.001% -0.000039 -72.329% 0.03421 5.4403 2.846 0.017 93.459 0.056
lmstudio-community/Qwen3-235B-A22B-Q3_K_L 104 -27.956% 0.000083 -14.266% 0.002466 -4.579% -1.294% 0.000766 -7.290% 0.786% 3.742% 6.267% 0.219563 12.470% 0.628216 24.126% 8.358958 77.349% 0.036266 0.000140 -0.284 0.010 11.904309 0.073302 0.023930 -0.010% -0.000003 -99.970% 0.03584 5.6582 4.496 0.025 89.756 0.069
unsloth/Qwen3-235B-A22B-UD-Q3_K_XL 97 -25.243% 0.000060 -12.180% 0.001945 -3.752% -0.962% 0.000612 -6.159% 0.874% 3.649% 5.976% 0.180988 11.713% 0.543533 22.421% 5.471307 64.130% 0.029122 0.000123 -0.059 0.009 11.855173 0.073300 0.018888 -0.000% -0.000004 -98.693% 0.03524 5.5695 4.018 0.023 90.694 0.066
Qwen/Qwen3-30B-A3B-BF16 56.9 15.1443 0.10239 0.07223 9.0703
ubergarm/Qwen3-30B-A3B-Q8_0 30.3 -7.050% 0.000001 -3.834% 0.000154 -1.241% -0.282% 0.000038 -2.035% 0.231% 1.176% 1.964% 0.013699 3.763% 0.039718 7.128% 0.359152 28.466% 0.002337 0.000009 -0.020 0.003 15.152095 0.102398 0.001587 -0.000% -0.000047 -34.379% 0.07228 9.0740 1.279 0.008 96.972 0.039
ubergarm/Qwen3-30B-A3B-mix-IQ4_K 17.7 -11.731% 0.000004 -5.522% 0.000298 -1.645% -0.376% 0.000080 -2.742% 0.326% 1.592% 2.682% 0.032109 5.373% 0.104454 10.626% 2.514502 39.508% 0.004821 0.000024 -0.025 0.004 15.218819 0.103071 0.002970 -0.000% -0.000048 -44.213% 0.07278 9.1184 1.818 0.011 95.945 0.045
bartowski/Qwen3-30B-A3B-Q4_K_M 17.4 -16.135% 0.000008 -8.303% 0.000652 -2.643% -0.645% 0.000171 -4.286% 0.398% 2.084% 3.570% 0.063238 7.356% 0.195169 14.392% 5.985787 61.522% 0.010136 0.000053 -0.158 0.006 15.194468 0.102605 0.006434 -0.001% -0.000032 -88.357% 0.07381 9.2092 2.619 0.018 94.329 0.053
bartowski/Qwen3-30B-A3B-Q4_K_S 16.8 -18.122% 0.000013 -9.230% 0.000862 -3.006% -0.780% 0.000235 -4.787% 0.402% 2.215% 3.866% 0.077885 7.972% 0.233980 15.420% 5.971601 66.795% 0.012915 0.000065 -0.227 0.007 15.202408 0.102513 0.008261 -0.002% -0.000038 -87.019% 0.07371 9.2232 2.885 0.019 93.804 0.055
unsloth/Qwen3-30B-A3B-UD-Q4_K_XL 16.5 -21.984% 0.000015 -11.111% 0.001152 -3.508% -0.938% 0.000315 -5.582% 0.421% 2.460% 4.261% 0.102021 8.910% 0.305740 17.384% 5.570370 67.990% 0.016495 0.000071 -0.320 0.008 15.281833 0.103140 0.010432 -0.005% -0.000016 -85.356% 0.07290 9.1688 3.333 0.020 93.169 0.058
ubergarm/Qwen3-30B-A3B-IQ4_KS 15.5 -20.721% 0.000018 -10.000% 0.001003 -3.073% -0.796% 0.000292 -5.017% 0.442% 2.398% 4.167% 0.094074 8.691% 0.282245 16.987% 6.828948 89.561% 0.014617 0.000068 -0.209 0.007 15.182811 0.102278 0.008934 -0.003% -0.000031 -75.475% 0.07061 8.9862 3.106 0.019 93.625 0.056
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Bartowski 15.3 -20.846% 0.000021 -10.497% 0.001098 -3.434% -0.905% 0.000316 -5.433% 0.421% 2.427% 4.216% 0.099815 8.719% 0.290617 17.546% 6.971420 81.571% 0.015818 0.000074 -0.288 0.007 15.150462 0.101931 0.009988 -0.004% -0.000029 -86.592% 0.07078 9.0016 3.244 0.020 93.317 0.057
ikawrakow/Qwen3-30B-A3B-IQ4_KS-IK 15.3 -21.414% 0.000026 -10.689% 0.001192 -3.461% -0.959% 0.000352 -5.489% 0.405% 2.383% 4.163% 0.102473 8.750% 0.301946 17.416% 7.146766 58.365% 0.016277 0.000074 -0.323 0.007 15.161535 0.101972 0.010269 -0.006% -0.000007 -90.822% 0.07094 9.0177 3.265 0.019 93.216 0.057
ikawrakow/Qwen3-30B-A3B-IQ4_KS-Unslolth 15.3 -21.919% 0.000023 -11.082% 0.001218 -3.610% -1.015% 0.000351 -5.698% 0.396% 2.355% 4.173% 0.104796 8.799% 0.314624 18.042% 7.383745 78.742% 0.016845 0.000077 -0.366 0.008 15.109454 0.101327 0.010667 -0.006% -0.000012 -86.065% 0.06945 8.9171 3.331 0.020 93.217 0.057
unsloth/Qwen3-30B-A3B-UD-IQ2_M 10.1 -47.141% 0.000072 -22.803% 0.004283 -6.698% -1.739% 0.001229 -11.071% 0.843% 4.934% 8.514% 0.457244 17.671% 1.370219 34.262% 8.153114 88.509% 0.066646 0.000267 -0.607 0.015 15.889509 0.107834 0.039668 -0.011% -0.000011 -99.283% 0.08541 10.3726 6.627 0.033 87.029 0.077
bartowski/Qwen3-30B-A3B-IQ2_M 9.7 -48.093% 0.000068 -24.583% 0.005231 -8.541% -2.590% 0.001459 -13.210% 0.538% 4.031% 7.477% 0.432021 16.466% 1.262156 31.659% 8.695639 80.027% 0.069100 0.000258 -1.300 0.016 15.436905 0.102661 0.044448 -0.039% -0.000004 -96.452% 0.08036 9.9788 6.979 0.033 86.303 0.079
👈 Benchmark Suite Raw Data Table

TODO copy/paste it all somewhere if there is enough interest.

👈 llama-sweep-bench Speed Data

bartowski/Q4_K_M

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.186 2746.40 0.912 140.37
512 128 512 0.189 2709.05 0.941 135.99
512 128 1024 0.190 2689.73 0.940 136.22
512 128 1536 0.195 2631.96 0.943 135.78
512 128 2048 0.197 2601.24 0.957 133.69
512 128 2560 0.201 2553.51 0.959 133.43
512 128 3072 0.203 2526.21 0.966 132.56
512 128 3584 0.207 2472.32 0.976 131.16
512 128 4096 0.210 2432.41 0.986 129.80
512 128 4608 0.213 2406.39 0.996 128.50
512 128 5120 0.215 2385.53 1.008 126.99
512 128 5632 0.218 2347.09 1.018 125.72
512 128 6144 0.221 2321.62 1.029 124.44
512 128 6656 0.224 2287.95 1.041 123.02
512 128 7168 0.227 2252.04 1.053 121.57
512 128 7680 0.231 2218.25 1.065 120.17
512 128 8192 0.233 2194.17 1.075 119.04
512 128 8704 0.235 2175.86 1.086 117.92
512 128 9216 0.240 2133.00 1.099 116.47
512 128 9728 0.241 2126.89 1.109 115.46
512 128 10240 0.245 2089.25 1.120 114.25
512 128 10752 0.249 2055.28 1.164 109.96
512 128 11264 0.252 2032.46 1.181 108.43
512 128 11776 0.254 2011.96 1.171 109.29
512 128 12288 0.257 1993.13 1.175 108.95
512 128 12800 0.260 1970.94 1.184 108.08
512 128 13312 0.264 1939.95 1.186 107.95
512 128 13824 0.265 1930.30 1.194 107.24
512 128 14336 0.270 1897.48 1.197 106.89
512 128 14848 0.272 1880.96 1.204 106.32
512 128 15360 0.276 1856.05 1.214 105.45
512 128 15872 0.279 1832.42 1.221 104.82
512 128 16384 0.283 1809.73 1.229 104.13
512 128 16896 0.285 1796.89 1.234 103.69
512 128 17408 0.288 1778.96 1.242 103.08
512 128 17920 0.293 1746.74 1.249 102.52
512 128 18432 0.296 1729.58 1.256 101.89
512 128 18944 0.298 1715.59 1.264 101.23
512 128 19456 0.302 1697.53 1.269 100.87
512 128 19968 0.304 1684.14 1.278 100.13
512 128 20480 0.307 1665.46 1.284 99.71
512 128 20992 0.311 1644.88 1.291 99.12
512 128 21504 0.314 1631.38 1.334 95.97
512 128 22016 0.317 1613.83 1.347 95.01
512 128 22528 0.321 1596.46 1.339 95.57
512 128 23040 0.322 1589.42 1.345 95.16
512 128 23552 0.325 1573.55 1.352 94.64
512 128 24064 0.329 1556.41 1.358 94.25
512 128 24576 0.333 1537.96 1.363 93.93
512 128 25088 0.335 1529.21 1.369 93.52
512 128 25600 0.340 1506.80 1.378 92.91
512 128 26112 0.343 1494.38 1.383 92.54
512 128 26624 0.347 1476.69 1.392 91.98
512 128 27136 0.350 1464.63 1.398 91.53
512 128 27648 0.353 1451.77 1.405 91.13
512 128 28160 0.355 1442.42 1.411 90.69
512 128 28672 0.359 1427.94 1.418 90.26
512 128 29184 0.362 1415.01 1.426 89.77
512 128 29696 0.364 1406.75 1.433 89.33
512 128 30208 0.367 1393.57 1.441 88.84
512 128 30720 0.371 1379.72 1.450 88.27
512 128 31232 0.374 1367.29 1.456 87.93
512 128 31744 0.378 1355.16 1.464 87.43
512 128 32256 0.381 1343.89 1.507 84.94

bartowski/Q2_K_L

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.219 2342.04 0.940 136.14
512 128 512 0.221 2320.24 0.968 132.17
512 128 1024 0.222 2302.08 0.968 132.25
512 128 1536 0.228 2245.09 0.976 131.11
512 128 2048 0.230 2230.09 0.990 129.34
512 128 2560 0.233 2201.35 0.998 128.21
512 128 3072 0.236 2168.36 1.005 127.38
512 128 3584 0.240 2128.94 1.014 126.18
512 128 4096 0.243 2102.88 1.025 124.82
512 128 4608 0.245 2093.47 1.035 123.68
512 128 5120 0.248 2062.11 1.045 122.44
512 128 5632 0.251 2042.84 1.057 121.12
512 128 6144 0.254 2016.60 1.069 119.78
512 128 6656 0.256 1996.33 1.081 118.46
512 128 7168 0.260 1965.62 1.090 117.42
512 128 7680 0.264 1939.11 1.103 116.03
512 128 8192 0.267 1917.69 1.114 114.86
512 128 8704 0.269 1902.68 1.123 113.97
512 128 9216 0.275 1864.88 1.139 112.41
512 128 9728 0.275 1864.80 1.149 111.43
512 128 10240 0.280 1831.10 1.173 109.12
512 128 10752 0.282 1813.40 1.209 105.90
512 128 11264 0.286 1792.80 1.224 104.61
512 128 11776 0.289 1769.64 1.217 105.19
512 128 12288 0.291 1756.56 1.219 104.97
512 128 12800 0.296 1730.89 1.230 104.08
512 128 13312 0.298 1717.56 1.231 103.94
512 128 13824 0.299 1709.78 1.237 103.48
512 128 14336 0.304 1684.98 1.241 103.15
512 128 14848 0.306 1672.32 1.247 102.63
512 128 15360 0.309 1657.69 1.251 102.28
512 128 15872 0.312 1642.84 1.258 101.72
512 128 16384 0.316 1620.66 1.265 101.16
512 128 16896 0.319 1603.11 1.271 100.68
512 128 17408 0.322 1592.25 1.280 100.04
512 128 17920 0.325 1573.98 1.286 99.52
512 128 18432 0.328 1560.54 1.295 98.82
512 128 18944 0.331 1547.27 1.303 98.27
512 128 19456 0.336 1525.32 1.308 97.87
512 128 19968 0.336 1523.96 1.317 97.16
512 128 20480 0.339 1509.92 1.323 96.72
512 128 20992 0.342 1498.56 1.328 96.36
512 128 21504 0.344 1487.29 1.368 93.54
512 128 22016 0.348 1469.52 1.386 92.32
512 128 22528 0.351 1458.22 1.377 92.95
512 128 23040 0.354 1447.65 1.383 92.56
512 128 23552 0.357 1434.13 1.392 91.95
512 128 24064 0.361 1417.81 1.397 91.60
512 128 24576 0.365 1401.75 1.400 91.40
512 128 25088 0.367 1395.82 1.408 90.89
512 128 25600 0.369 1387.75 1.412 90.67
512 128 26112 0.374 1368.77 1.418 90.29
512 128 26624 0.377 1359.02 1.427 89.71
512 128 27136 0.380 1347.28 1.434 89.25
512 128 27648 0.383 1336.61 1.439 88.92
512 128 28160 0.387 1322.05 1.446 88.50
512 128 28672 0.389 1315.73 1.454 88.02
512 128 29184 0.392 1307.57 1.461 87.58
512 128 29696 0.395 1295.59 1.468 87.16
512 128 30208 0.400 1281.33 1.475 86.77
512 128 30720 0.403 1269.72 1.485 86.17
512 128 31232 0.406 1260.77 1.493 85.75
512 128 31744 0.411 1245.97 1.499 85.37
512 128 32256 0.411 1244.60 1.538 83.20

bartowski/IQ2_M

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.199 2571.39 0.929 137.72
512 128 512 0.200 2558.87 0.958 133.66
512 128 1024 0.205 2502.88 0.958 133.60
512 128 1536 0.209 2449.39 0.966 132.45
512 128 2048 0.211 2424.91 0.979 130.70
512 128 2560 0.214 2387.42 0.981 130.51
512 128 3072 0.217 2359.21 0.990 129.36
512 128 3584 0.220 2322.95 1.001 127.93
512 128 4096 0.224 2281.51 1.011 126.63
512 128 4608 0.226 2264.66 1.020 125.44
512 128 5120 0.228 2246.85 1.031 124.21
512 128 5632 0.231 2218.24 1.040 123.07
512 128 6144 0.235 2177.99 1.054 121.47
512 128 6656 0.237 2158.85 1.065 120.14
512 128 7168 0.241 2124.91 1.078 118.72
512 128 7680 0.245 2088.47 1.094 116.98
512 128 8192 0.248 2066.12 1.106 115.68
512 128 8704 0.250 2044.39 1.117 114.54
512 128 9216 0.253 2023.04 1.130 113.27
512 128 9728 0.256 2002.81 1.141 112.18
512 128 10240 0.259 1980.01 1.154 110.94
512 128 10752 0.263 1945.18 1.198 106.84
512 128 11264 0.265 1928.54 1.211 105.70
512 128 11776 0.268 1908.01 1.204 106.28
512 128 12288 0.271 1891.82 1.207 106.08
512 128 12800 0.275 1861.92 1.216 105.27
512 128 13312 0.277 1846.15 1.219 104.99
512 128 13824 0.280 1829.45 1.226 104.43
512 128 14336 0.283 1807.34 1.229 104.17
512 128 14848 0.286 1789.55 1.233 103.77
512 128 15360 0.289 1774.14 1.241 103.12
512 128 15872 0.293 1750.23 1.248 102.55
512 128 16384 0.296 1730.68 1.256 101.88
512 128 16896 0.299 1713.86 1.261 101.49
512 128 17408 0.301 1700.49 1.271 100.72
512 128 17920 0.306 1671.47 1.281 99.93
512 128 18432 0.310 1652.08 1.291 99.17
512 128 18944 0.313 1637.83 1.299 98.53
512 128 19456 0.316 1618.98 1.302 98.32
512 128 19968 0.317 1612.79 1.314 97.42
512 128 20480 0.321 1595.76 1.319 97.04
512 128 20992 0.326 1572.01 1.327 96.43
512 128 21504 0.328 1561.24 1.369 93.51
512 128 22016 0.332 1543.74 1.383 92.57
512 128 22528 0.335 1529.05 1.373 93.23
512 128 23040 0.336 1524.73 1.374 93.17
512 128 23552 0.337 1517.70 1.386 92.33
512 128 24064 0.343 1493.95 1.387 92.27
512 128 24576 0.346 1481.52 1.393 91.88
512 128 25088 0.349 1466.47 1.401 91.37
512 128 25600 0.350 1462.59 1.406 91.06
512 128 26112 0.356 1438.68 1.413 90.61
512 128 26624 0.359 1425.06 1.418 90.29
512 128 27136 0.361 1417.08 1.426 89.75
512 128 27648 0.365 1403.93 1.433 89.33
512 128 28160 0.368 1389.95 1.442 88.74
512 128 28672 0.371 1380.36 1.454 88.02
512 128 29184 0.374 1369.27 1.458 87.79
512 128 29696 0.378 1355.92 1.465 87.36
512 128 30208 0.381 1345.24 1.471 87.01
512 128 30720 0.383 1336.71 1.482 86.39
512 128 31232 0.387 1324.60 1.486 86.11
512 128 31744 0.390 1311.28 1.494 85.65
512 128 32256 0.393 1302.29 1.535 83.40

unsloth/UD-Q4_K_XL

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.185 2771.13 0.895 143.07
512 128 512 0.187 2735.63 0.923 138.71
512 128 1024 0.190 2699.01 0.921 138.95
512 128 1536 0.195 2627.30 0.930 137.64
512 128 2048 0.196 2614.49 0.943 135.73
512 128 2560 0.200 2560.59 0.947 135.10
512 128 3072 0.202 2528.42 0.954 134.19
512 128 3584 0.206 2481.69 0.964 132.77
512 128 4096 0.210 2443.23 0.974 131.47
512 128 4608 0.212 2413.67 0.985 129.96
512 128 5120 0.214 2394.67 0.995 128.61
512 128 5632 0.219 2340.45 1.015 126.14
512 128 6144 0.222 2306.96 1.024 125.01
512 128 6656 0.225 2273.36 1.035 123.64
512 128 7168 0.228 2242.54 1.050 121.92
512 128 7680 0.231 2212.63 1.060 120.71
512 128 8192 0.235 2182.09 1.068 119.82
512 128 8704 0.237 2157.82 1.082 118.25
512 128 9216 0.241 2123.14 1.097 116.72
512 128 9728 0.243 2109.32 1.104 115.90
512 128 10240 0.246 2077.16 1.119 114.35
512 128 10752 0.250 2049.47 1.168 109.62
512 128 11264 0.254 2017.75 1.183 108.21
512 128 11776 0.255 2009.66 1.173 109.13
512 128 12288 0.259 1976.27 1.176 108.86
512 128 12800 0.261 1957.95 1.186 107.97
512 128 13312 0.266 1926.83 1.187 107.84
512 128 13824 0.267 1914.87 1.191 107.45
512 128 14336 0.271 1888.06 1.196 107.00
512 128 14848 0.274 1869.73 1.202 106.49
512 128 15360 0.277 1849.09 1.209 105.84
512 128 15872 0.280 1828.40 1.215 105.35
512 128 16384 0.284 1801.44 1.224 104.57
512 128 16896 0.287 1781.87 1.229 104.13
512 128 17408 0.290 1767.18 1.239 103.35
512 128 17920 0.293 1747.06 1.245 102.83
512 128 18432 0.296 1731.39 1.252 102.25
512 128 18944 0.299 1712.43 1.259 101.64
512 128 19456 0.303 1690.65 1.265 101.17
512 128 19968 0.304 1682.41 1.276 100.31
512 128 20480 0.308 1660.25 1.280 99.99
512 128 20992 0.312 1641.94 1.285 99.57
512 128 21504 0.314 1628.35 1.331 96.17
512 128 22016 0.318 1611.79 1.346 95.11
512 128 22528 0.321 1596.28 1.337 95.72
512 128 23040 0.324 1580.92 1.340 95.54
512 128 23552 0.325 1573.30 1.351 94.74
512 128 24064 0.330 1552.94 1.350 94.81
512 128 24576 0.334 1534.84 1.355 94.48
512 128 25088 0.335 1526.93 1.361 94.06
512 128 25600 0.339 1511.89 1.366 93.70
512 128 26112 0.343 1492.70 1.383 92.55
512 128 26624 0.347 1476.86 1.387 92.27
512 128 27136 0.350 1462.35 1.397 91.63
512 128 27648 0.354 1446.91 1.404 91.16
512 128 28160 0.356 1438.02 1.412 90.66
512 128 28672 0.361 1419.66 1.418 90.26
512 128 29184 0.362 1413.92 1.426 89.77
512 128 29696 0.365 1401.20 1.433 89.32
512 128 30208 0.368 1391.23 1.439 88.97
512 128 30720 0.372 1377.54 1.450 88.29
512 128 31232 0.374 1369.93 1.453 88.09
512 128 31744 0.378 1356.09 1.462 87.56
512 128 32256 0.380 1347.04 1.503 85.14

unsloth/UD-Q2_K_XL

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.211 2423.89 0.943 135.74
512 128 512 0.213 2399.31 0.971 131.85
512 128 1024 0.216 2374.33 0.969 132.11
512 128 1536 0.219 2340.30 0.979 130.80
512 128 2048 0.220 2325.55 0.991 129.11
512 128 2560 0.225 2276.44 0.994 128.82
512 128 3072 0.228 2247.84 1.001 127.89
512 128 3584 0.232 2207.44 1.011 126.60
512 128 4096 0.236 2170.20 1.023 125.06
512 128 4608 0.236 2166.89 1.032 124.00
512 128 5120 0.240 2131.06 1.044 122.58
512 128 5632 0.244 2102.12 1.054 121.41
512 128 6144 0.247 2076.33 1.063 120.43
512 128 6656 0.249 2055.14 1.077 118.82
512 128 7168 0.253 2024.47 1.088 117.68
512 128 7680 0.256 1996.90 1.099 116.45
512 128 8192 0.260 1967.17 1.114 114.93
512 128 8704 0.260 1967.20 1.122 114.06
512 128 9216 0.266 1922.64 1.135 112.81
512 128 9728 0.268 1911.09 1.147 111.63
512 128 10240 0.272 1885.44 1.157 110.64
512 128 10752 0.274 1865.36 1.202 106.45
512 128 11264 0.278 1844.60 1.217 105.18
512 128 11776 0.279 1836.43 1.208 105.93
512 128 12288 0.283 1810.13 1.213 105.57
512 128 12800 0.288 1780.11 1.229 104.16
512 128 13312 0.291 1758.14 1.229 104.12
512 128 13824 0.292 1753.98 1.238 103.39
512 128 14336 0.298 1718.12 1.241 103.10
512 128 14848 0.300 1706.26 1.247 102.61
512 128 15360 0.302 1693.28 1.254 102.07
512 128 15872 0.306 1673.01 1.262 101.46
512 128 16384 0.310 1650.90 1.268 100.96
512 128 16896 0.313 1638.03 1.275 100.41
512 128 17408 0.315 1625.29 1.281 99.90
512 128 17920 0.318 1609.23 1.289 99.31
512 128 18432 0.322 1589.10 1.297 98.68
512 128 18944 0.325 1575.42 1.302 98.29
512 128 19456 0.330 1553.28 1.310 97.73
512 128 19968 0.330 1552.98 1.319 97.05
512 128 20480 0.334 1531.58 1.324 96.67
512 128 20992 0.337 1518.07 1.332 96.12
512 128 21504 0.340 1507.15 1.373 93.25
512 128 22016 0.344 1488.06 1.385 92.41
512 128 22528 0.347 1477.13 1.378 92.88
512 128 23040 0.349 1467.54 1.384 92.47
512 128 23552 0.351 1459.50 1.394 91.80
512 128 24064 0.356 1440.13 1.397 91.61
512 128 24576 0.359 1426.95 1.401 91.36
512 128 25088 0.360 1423.59 1.409 90.82
512 128 25600 0.364 1405.52 1.413 90.62
512 128 26112 0.369 1388.93 1.419 90.18
512 128 26624 0.371 1379.47 1.426 89.79
512 128 27136 0.374 1369.38 1.434 89.28
512 128 27648 0.377 1357.58 1.441 88.85
512 128 28160 0.382 1342.07 1.447 88.44
512 128 28672 0.384 1333.90 1.455 87.99
512 128 29184 0.386 1326.66 1.461 87.62
512 128 29696 0.390 1313.92 1.468 87.22
512 128 30208 0.394 1298.28 1.483 86.34
512 128 30720 0.398 1286.81 1.488 86.02
512 128 31232 0.400 1280.36 1.494 85.70
512 128 31744 0.405 1263.20 1.502 85.21
512 128 32256 0.407 1257.02 1.545 82.83

unsloth/UD-IQ2_M

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.198 2588.92 0.982 130.29
512 128 512 0.199 2574.42 1.008 127.04
512 128 1024 0.203 2527.70 1.007 127.07
512 128 1536 0.206 2488.20 1.017 125.90
512 128 2048 0.207 2468.48 1.031 124.15
512 128 2560 0.211 2427.66 1.037 123.42
512 128 3072 0.215 2376.22 1.045 122.45
512 128 3584 0.218 2344.71 1.055 121.32
512 128 4096 0.220 2323.83 1.066 120.13
512 128 4608 0.224 2286.56 1.075 119.08
512 128 5120 0.226 2263.56 1.086 117.87
512 128 5632 0.229 2233.20 1.097 116.64
512 128 6144 0.231 2216.06 1.108 115.56
512 128 6656 0.235 2174.18 1.125 113.82
512 128 7168 0.239 2141.53 1.137 112.61
512 128 7680 0.244 2099.48 1.148 111.48
512 128 8192 0.245 2087.77 1.160 110.38
512 128 8704 0.247 2076.19 1.170 109.38
512 128 9216 0.251 2040.21 1.183 108.22
512 128 9728 0.252 2028.41 1.192 107.39
512 128 10240 0.255 2006.18 1.204 106.35
512 128 10752 0.257 1988.99 1.247 102.62
512 128 11264 0.261 1963.06 1.264 101.28
512 128 11776 0.262 1951.61 1.257 101.84
512 128 12288 0.265 1932.03 1.260 101.62
512 128 12800 0.269 1901.02 1.269 100.83
512 128 13312 0.272 1882.44 1.271 100.72
512 128 13824 0.273 1873.24 1.274 100.48
512 128 14336 0.277 1845.12 1.281 99.91
512 128 14848 0.280 1830.87 1.290 99.24
512 128 15360 0.282 1812.46 1.296 98.79
512 128 15872 0.286 1793.02 1.302 98.31
512 128 16384 0.288 1778.72 1.309 97.80
512 128 16896 0.293 1745.22 1.316 97.26
512 128 17408 0.295 1732.67 1.323 96.76
512 128 17920 0.299 1714.14 1.331 96.19
512 128 18432 0.301 1698.58 1.337 95.74
512 128 18944 0.306 1675.72 1.350 94.84
512 128 19456 0.307 1668.01 1.349 94.88
512 128 19968 0.313 1636.65 1.360 94.11
512 128 20480 0.314 1632.97 1.366 93.72
512 128 20992 0.316 1620.05 1.374 93.17
512 128 21504 0.319 1606.86 1.411 90.70
512 128 22016 0.322 1590.15 1.426 89.75
512 128 22528 0.327 1567.20 1.422 90.01
512 128 23040 0.330 1553.12 1.425 89.83
512 128 23552 0.333 1536.30 1.434 89.28
512 128 24064 0.337 1520.89 1.434 89.24
512 128 24576 0.339 1508.19 1.440 88.87
512 128 25088 0.343 1492.82 1.446 88.52
512 128 25600 0.344 1487.87 1.451 88.21
512 128 26112 0.350 1461.28 1.459 87.74
512 128 26624 0.350 1463.85 1.466 87.32
512 128 27136 0.354 1445.83 1.474 86.86
512 128 27648 0.357 1432.50 1.485 86.20
512 128 28160 0.363 1410.51 1.487 86.10
512 128 28672 0.365 1402.82 1.493 85.72
512 128 29184 0.368 1389.55 1.502 85.22
512 128 29696 0.371 1379.92 1.508 84.87
512 128 30208 0.374 1367.99 1.514 84.55
512 128 30720 0.377 1359.40 1.524 84.00
512 128 31232 0.378 1353.18 1.529 83.72
512 128 31744 0.382 1338.84 1.538 83.22
512 128 32256 0.386 1327.16 1.578 81.10

Appendix and Definitions

👈 PPL, KLD, Δp Statistics In general these attempt to systematically measure the difference an unquantized model and a given quantized version. In general lower is better as it signals the quantized version performs more similarly to the original.

Quantization is the process of compressing an original model's weights to shrink it down to run on limited hardware. Ideally the process minimizes errors and preserves the original uncompressed model's performance.

Perplexity (PPL)

Perplexity (PPL) is a metric used to evaluate how well a language model predicts text. It essentially measures how "surprised" the model is by a given text—if the model is good at predicting the next word, the perplexity is low. For example, a model that generates coherent, contextually accurate text will have lower perplexity than one that produces random or nonsensical output.

In the context of LLM quantization (e.g., reducing model precision to save resources), perplexity is used to check if the compressed model retains its language understanding. Generally the PPL of the unquantized model is expected to be lower than the PPL of a quantized version.

However, in quantization-aware training (QAT), the model is trained to handle lower-precision weights (e.g., from bf16 to int4) during training, simulating the effects of quantization. This helps the model adapt to the reduced precision, potentially maintaining performance even after quantization.

The PPL of the unquantized bf16 model might not always be lower because the quantized model (trained with QAT) might retain performance close to the original bf16 model. If QAT is effective, the quantized model’s PPL could be similar to or even higher than the original, meaning the unquantized model’s PPL isn’t necessarily lower.

Kullback-Leibler Divergence (KLD)

KL-Divergence (KLD) is a statistical measure that quantifies how different two probability distributions are. In the context of Large Language Models (LLMs) and quantization, it’s used to compare how a compressed (quantized) model differs from the original (unquantized) model in terms of their output probabilities.

If two models produce nearly identical predictions (e.g., same probabilitie for words in a sentence), their KLD is low. If their predictions diverge significantly (e.g., the quantized model chooses different word more often), the KLD is high.

Typically a very large KLD baseline data file is generated on the original (or least quantized) version of the model. This baseline is then compared against quantized versions to measure KLD as well as Δp.

Δp (Delta p)

Δp Token Probability Distribution Difference refers to the difference in token probability distributions between an unquantized (full-precision) model and a quantized model. It measures how much the probabilities assigned to individual tokens (e.g., words or subwords) change after quantization.

For example, for each token in a given input sequence, the unquantized model computes a probability distribution over the vocabulary (e.g., "the" has 10% chance, "cat" has 5%, etc.). The quantized model (e.g., IQ4_K or Q2_K_L) computes a similar distribution, but due to precision loss, the probabilities may shift. Δp is the absolute or relative difference between these two distributions for each token.

A specific example would be that the unquantized model assigns 0.2 to "cat" and the quantized model assigns 0.15, the Δp for "cat" is 0.05.

👈 Benchmark Suites

Benchmarking Suite

GPQA Diamond

GPQA Diamond Set: A subset of 198 high-objectivity, challenging multiple-choice questions designed for advanced testing. Difficulty aligns with college-level or higher expertise in biology, physics, and chemistry. Intended for evaluating AI systems' ability to handle complex, domain-specific tasks requiring deep knowledge and critical thinking.

MBPP

MBPP Mostly Basic Programming Problems is a benchmark dataset designed to evaluate large language models (LLMs) on programming tasks focusing on Python code. The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.

MMLU-Pro

MMLU-Pro is an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising the difficulty and reducing the chance of success through random guessing. MMLU-Pro comprises over 12,000 rigorously curated questions from academic exams and textbooks, spanning 14 diverse domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.

MT-Bench

MT-Bench is a benchmark designed to evaluate the multi-turn conversational abilities and instruction-following skills of large language models (LLMs). Unlike traditional benchmarks that focus on closed-ended tasks (e.g., multiple-choice questions), MT-Bench emphasizes open-ended, real-world interactions to measure how well models handle complex, dynamic dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.

MixEval

MixEval is a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.

References

@leonbeckert
Copy link

@ubergarm
thank you for the response!

Yes the raw benchmark data would be interesting, the other data is already very insightful but I was hoping to be able to gain some additional insights by interpreting them with some custom written scripts.

Yes I wanted to try your version out initally, but I found the ikawrakow KS quants to perform really well

@abdurrahmanregi
Copy link

Thank you for the benchmark. Do you also happen to test when the think mode was used? I am particularly interested to know Qwen 3 30B MoE IQ2_XS or IQ2_M for AIME '24 and '25, and also for GPQA Diamond.

@ubergarm
Copy link
Author

ubergarm commented Jul 5, 2025

@abdurrahmanregi

Thank you for the benchmark. Do you also happen to test when the think mode was used? I am particularly interested to know Qwen 3 30B MoE IQ2_XS or IQ2_M for AIME '24 and '25, and also for GPQA Diamond.

No, thinking mode adds a lot of tokens and takes more time so it was not tested in this benchmark. If you look at the Qwen3 paper and official benchmarks, it suggests that thinking can help a lot for logic/math/coding type questions. iirc Qwen3-14B dense with thinking enabled scored better than Qwen3-32B dense without thinking.

Also note that Qwen3-30B-A3B is quite performant on ik_llama.cpp with CPU only inference so if you have a decent speed DDR5 RAM system you could run a higher sized quant reasonably fast if you prefer quality output at the cost of speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment