Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggerganov/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
- Last updated 2024-02-27 (add IQ4_XS).
- imatrix from wiki.train, 200*512 tokens.
- KL-divergence measured on wiki.test.
Bits per weight | KL-divergence median | KL-divergence q99 | Top tokens differ | ln(PPL(Q)/PPL(base)) | |
---|---|---|---|---|---|
IQ1_S | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 |
IQ2_XXS | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 |
IQ2_XS | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 |
IQ2_S | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 |
IQ2_M | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 |
Q2_K_S | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
Q2_K | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
IQ3_XXS | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
IQ3_XS | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
Q3_K_S | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
IQ3_S | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
IQ3_M | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
Q3_K_M | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
Q3_K_L | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
IQ4_XS | 4.32 | 0.0088 | 0.1082 | 0.0606 | 0.0079 |
IQ4_NL | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
Q4_K_S | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
Q4_K_M | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
Q5_K_S | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
Q5_K_M | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
Q6_K | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
- Last updated 2024-03-15 (bench #6083).
GiB | pp512 -ngl 99 | tg128 -ngl 99 | pp512 -ngl 0 | tg128 -ngl 0 | pp512 -ngl 0 #6083 | |
---|---|---|---|---|---|---|
IQ1_S | 1.50 | 709.29 | 74.85 | 324.35 | 15.66 | 585.61 |
IQ2_XS | 2.05 | 704.52 | 58.44 | 316.10 | 15.11 | 557.68 |
IQ3_XS | 2.79 | 682.72 | 45.79 | 300.61 | 10.49 | 527.83 |
IQ4_XS | 3.64 | 712.96 | 64.17 | 292.36 | 11.06 | 495.92 |
Q4_0 | 3.83 | 870.44 | 63.42 | 310.94 | 10.44 | 554.56 |
Q5_K | 4.78 | 691.40 | 46.52 | 273.83 | 8.54 | 453.58 |
Q6_K | 5.53 | 661.98 | 47.57 | 261.16 | 7.34 | 415.22 |
Q8_0 | 7.17 | 881.95 | 39.74 | 270.70 | 5.74 | 440.44 |
f16 | 13.49 | 211.12 | 3.06 | 303.60 |
Thank you! I have a question for you and advice for everyone else:
Question:
"I am partially offloading (running on CPU+GPU): use Q4_K_S"
What about the 2 and 3 bit regular K quants? I know they're slower but if I truly have no more vram, do i want FFN from these on the CPU or fewer IQ layers? IQ is more expensive to calculate but idk if the hidden state getting squeezed out of the PCIe tubes is any smaller. Could depend where the 'neck is?
Is IQ4_NL possibly faster and better? I thought it was supposed to be like Q4_0, which definitely makes CPUs happy?
Nuance:
Fitting as much as possible on the GPU involves:
but much less does it mean (i.e you don't need to:
Without getting technical (because this got changed - there used to be 3 of these layers? IIRC? Is one of them lm_head? does it get bigger if you --leave-output-tensor? (??) ), keeping these two (the KV cache) and the Last Layer (seriously subtract 1 from the total number of layers you see when you do a full offload - or however many non-repeating non-linear layers there are) on RAM together doesn't add much slowdown compared to just one or the other**. But both are very large and fairly light on CPU calculations, relatively.
This matters more with bigger models (where there are more layers) with deeper quantization (where the layers are smaller in tersm of memory usage) because these other two* become bigger and bigger contributors.
none of this would matter at all if llama.cpp would grow some DeepSpeed style architectural Grit and start shuffling the actual parameters out at inference time. Especially when the models have so many layers and they're each so tiny - is moving a few 50MB IQ2 layer from a 120B model up and down the PCIe bus once per token really too slow to countenance? show us your war face and 842 them as well. yes YES. ROOFLINE IT BROTHER. HIT THAT ARITHMETIC INTENSITY LEVER AND JUST BROTLI-G RAW BF16 OUT OF NVME UNTIL 90% OF THE GPU IS DOING DECOMPRESSION YOU FILT-
aight peace
*for mixtral none of this is really more true than it is of mistral. you won't save gigabytes here if you wouldn't with mistral. KV and Last Layer both.
**i don't have a memory of explicitly testing this but I remember quickly learning it just by fiddling trying to fit a miqu on a 3090