Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
- Last updated 2024-02-27 (add IQ4_XS).
- imatrix from wiki.train, 200*512 tokens.
- KL-divergence measured on wiki.test.
Bits per weight | KL-divergence median | KL-divergence q99 | Top tokens differ | ln(PPL(Q)/PPL(base)) | |
---|---|---|---|---|---|
IQ1_S | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 |
IQ2_XXS | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 |
IQ2_XS | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 |
IQ2_S | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 |
IQ2_M | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 |
Q2_K_S | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
Q2_K | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
IQ3_XXS | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
IQ3_XS | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
Q3_K_S | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
IQ3_S | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
IQ3_M | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
Q3_K_M | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
Q3_K_L | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
IQ4_XS | 4.32 | 0.0088 | 0.1082 | 0.0606 | 0.0079 |
IQ4_NL | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
Q4_K_S | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
Q4_K_M | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
Q5_K_S | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
Q5_K_M | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
Q6_K | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
- Last updated 2024-03-15 (bench #6083).
GiB | pp512 -ngl 99 | tg128 -ngl 99 | pp512 -ngl 0 | tg128 -ngl 0 | pp512 -ngl 0 #6083 | |
---|---|---|---|---|---|---|
IQ1_S | 1.50 | 709.29 | 74.85 | 324.35 | 15.66 | 585.61 |
IQ2_XS | 2.05 | 704.52 | 58.44 | 316.10 | 15.11 | 557.68 |
IQ3_XS | 2.79 | 682.72 | 45.79 | 300.61 | 10.49 | 527.83 |
IQ4_XS | 3.64 | 712.96 | 64.17 | 292.36 | 11.06 | 495.92 |
Q4_0 | 3.83 | 870.44 | 63.42 | 310.94 | 10.44 | 554.56 |
Q5_K | 4.78 | 691.40 | 46.52 | 273.83 | 8.54 | 453.58 |
Q6_K | 5.53 | 661.98 | 47.57 | 261.16 | 7.34 | 415.22 |
Q8_0 | 7.17 | 881.95 | 39.74 | 270.70 | 5.74 | 440.44 |
f16 | 13.49 | 211.12 | 3.06 | 303.60 |
You can import the gguf model to ollama by creating a new model file and referencing the gguf file. For example,
https://huggingface.co/aifeifei799/Llama-3.1-8B-Instruct-Fei-v1-Uncensored and want to import it into ollama.
Assuming you have llama.cpp:
File: Modelfile.Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M.txt
# Modelfile Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M FROM /home/spanky/projects/models/llama/Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M.gguf ...
Run the create command which pulls in the model and repackages it for ollama
Now you can use the model in ollama by specifying 'Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M' as the model.
You can also reuse this model in further variations with custom template, system prompt and parameters without much addition disk overhead. Create additional model files that reference your imported model instead of the gguf file on disk. Then use a simpler naming scheme when you create the model.
Now you can use it in ollama by referencing 'moo:latest' as your model.