Artefact2/README.md

Last active August 12, 2025 23:33

Star (251) You must be signed in to star a gist
Fork (22) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9.js"></script>
Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.

Download ZIP

GGUF quantizations overview

Raw

README.md

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

KL-divergence statistics for Mistral-7B

Last updated 2024-02-27 (add IQ4_XS).
imatrix from wiki.train, 200*512 tokens.
KL-divergence measured on wiki.test.

	Bits per weight	KL-divergence median	KL-divergence q99	Top tokens differ	ln(PPL(Q)/PPL(base))
IQ1_S	1.78	0.5495	5.5174	0.3840	0.9235
IQ2_XXS	2.20	0.1751	2.4983	0.2313	0.2988
IQ2_XS	2.43	0.1146	1.7693	0.1943	0.2046
IQ2_S	2.55	0.0949	1.6284	0.1806	0.1722
IQ2_M	2.76	0.0702	1.0935	0.1557	0.1223
Q2_K_S	2.79	0.0829	1.5111	0.1735	0.1600
Q2_K	3.00	0.0588	1.0337	0.1492	0.1103
IQ3_XXS	3.21	0.0330	0.5492	0.1137	0.0589
IQ3_XS	3.32	0.0296	0.4550	0.1071	0.0458
Q3_K_S	3.50	0.0304	0.4481	0.1068	0.0511
IQ3_S	3.52	0.0205	0.3018	0.0895	0.0306
IQ3_M	3.63	0.0186	0.2740	0.0859	0.0268
Q3_K_M	3.89	0.0171	0.2546	0.0839	0.0258
Q3_K_L	4.22	0.0152	0.2202	0.0797	0.0205
IQ4_XS	4.32	0.0088	0.1082	0.0606	0.0079
IQ4_NL	4.56	0.0085	0.1077	0.0605	0.0074
Q4_K_S	4.57	0.0083	0.1012	0.0600	0.0081
Q4_K_M	4.83	0.0075	0.0885	0.0576	0.0060
Q5_K_S	5.52	0.0045	0.0393	0.0454	0.0005
Q5_K_M	5.67	0.0043	0.0368	0.0444	0.0005
Q6_K	6.57	0.0032	0.0222	0.0394	−0.0008

ROCm benchmarks for Mistral-7B

Last updated 2024-03-15 (bench #6083).

	GiB	pp512 -ngl 99	tg128 -ngl 99	pp512 -ngl 0	tg128 -ngl 0	pp512 -ngl 0 #6083
IQ1_S	1.50	709.29	74.85	324.35	15.66	585.61
IQ2_XS	2.05	704.52	58.44	316.10	15.11	557.68
IQ3_XS	2.79	682.72	45.79	300.61	10.49	527.83
IQ4_XS	3.64	712.96	64.17	292.36	11.06	495.92
Q4_0	3.83	870.44	63.42	310.94	10.44	554.56
Q5_K	4.78	691.40	46.52	273.83	8.54	453.58
Q6_K	5.53	661.98	47.57	261.16	7.34	415.22
Q8_0	7.17	881.95	39.74	270.70	5.74	440.44
f16	13.49			211.12	3.06	303.60

Mayorc1978 commented Mar 18, 2024

@complexinteractive Good question. Probably similar, but I don't have the hardware to generate KL-divergence on a 70b model.

Any chance on seeing you generate KL-divergence stats for models in range (30B to Mixtral)?

Author

Artefact2 commented Mar 18, 2024 •

edited

Loading

Running unquantized mixtral would take over 180 GB of memory, which I don't have (GGUF can't store tensors in BF16).

Something like Command-R might work, after ggml-org/llama.cpp#6104 is fixed.

diimdeep commented Mar 28, 2024 •

edited

Loading

Awesome,
could you please, also graph inference speed on CPU and GPU across different quantizations, I observe that IQ3_S much slower than Q5_K_M on x86 CPU

franva commented Apr 20, 2024

@Artefact2 Thanks for your document.

I am beginner in Quantization, I have a hard time to figuring out what does the "I", "K", "M" mean in LLM model names.
e.g. IQ3_M, IQ3_XXS, Q3_K_M

I can guess Q means: Quantization, but what about M, XS, XXS are they an indicator for size?(what size??)
How about "I" and "K"?

Appreciate if you could explain them in a very plain language.

cha0sbuster commented May 6, 2024

@\Artefact2 Thanks for your document.

I am beginner in Quantization, I have a hard time to figuring out what does the "I", "K", "M" mean in LLM model names. e.g. IQ3_M, IQ3_XXS, Q3_K_M

I can guess Q means: Quantization, but what about M, XS, XXS are they an indicator for size?(what size??) How about "I" and "K"?

Appreciate if you could explain them in a very plain language.

@franva
I started typing an explainer and then it got a little long so i turned it into a Rentry.

@cha0sbuster Hi man, I have read your writing, you did a great job~! I think you can event make your Rentry note into a article on Medium , I'm sure people like me who knows a bit AI but when looking at the naming of LLM and gets confused and frustrated will appreciate your great explanation :)

One more question, I noticed that when you use "I", you use it together with "Q", but when you use "K", you mentioned it alone. Why?

cha0sbuster commented May 7, 2024 •

edited

Loading

@cha0sbuster Hi man, I have read your writing, you did a great job~! I think you can event make your Rentry note into a article on Medium , I'm sure people like me who knows a bit AI but when looking at the naming of LLM and gets confused and frustrated will appreciate your great explanation :)

thanks! ^^ I used Rentry because it's standard in the parts of the community I frequent, and my presence on Medium is ill-fit for this (I write about music there already.)

One more question, I noticed that when you use "I", you use it together with "Q", but when you use "K", you mentioned it alone. Why?

Since they're split by underscore it's more common to say "IQ" or "K". It's just convention.

franva commented May 8, 2024

okiee, thank you !!! All done.

Weroxig commented May 21, 2024

will you be able to check the numbers for a larger model too? some people say that larger models (lets say for example any 70b+ model) are impacted by quantization less than small models.

Author

Artefact2 commented May 23, 2024

@Weroxig See https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9?permalink_comment_id=4971263#gistcomment-4971263

savchenko commented Nov 28, 2024

Would you be able to add Q5_K_M to the text matrix? Also, do you plan to release the test routine so that it can be reproduced locally?

Thanks.

alchemistbo commented Jan 21, 2025

great

JamesFurriery commented Feb 12, 2025

HI, I have a question any body for replying back?

eatmelonhan commented Feb 16, 2025

支持无限制对话吗？有思维钢印吗？

katfionn commented Jun 9, 2025

if i want to use it with Ollama, how should i write the modelfile?

afsara-ben commented Jun 18, 2025

@cha0sbuster read your article - would like to point out that both IQ and K quants can have importance matrix

GaryMatthews commented Aug 12, 2025 •

edited

Loading

if i want to use it with Ollama, how should i write the modelfile?

You can import the gguf model to ollama by creating a new model file and referencing the gguf file. For example,
https://huggingface.co/aifeifei799/Llama-3.1-8B-Instruct-Fei-v1-Uncensored and want to import it into ollama.

Assuming you have llama.cpp:

use 'convert_hf_to_gguf.py' to convert your model to ggml f16 format.
Use 'llama-finetune' to fine tune the model using custom data. (Creates LoRA files)
Use 'llama-export-lora' to merge your custom LoRA adaptors into your model.
Use 'llama-quantize' to quantize your model. e.g Q4_K_M
Create a new model file to reference the models gguf. (See Below)
Use 'ollama create' to import the model to ollama.

File: Modelfile.Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M.txt

# Modelfile Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M

FROM /home/spanky/projects/models/llama/Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M.gguf

...

Run the create command which pulls in the model and repackages it for ollama

ollama create Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M -f Modelfile.Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M.txt

Now you can use the model in ollama by specifying 'Llama-3.1-8B-Instruct-Fei-v1-Uncensored.Q4_K_M' as the model.

You can also reuse this model in further variations with custom template, system prompt and parameters without much addition disk overhead. Create additional model files that reference your imported model instead of the gguf file on disk. Then use a simpler naming scheme when you create the model.

ollama create moo:latest -f modelfile.moo.txt

Now you can use it in ollama by referencing 'moo:latest' as your model.

Artefact2/README.md

Which GGUF is right for me? (Opinionated)

llama.cpp feature matrix

KL-divergence statistics for Mistral-7B

ROCm benchmarks for Mistral-7B

Mayorc1978 commented Mar 18, 2024

Uh oh!

Artefact2 commented Mar 18, 2024 •

edited

Loading

Uh oh!

diimdeep commented Mar 28, 2024 •

edited

Loading

Uh oh!

franva commented Apr 20, 2024

Uh oh!

cha0sbuster commented May 6, 2024

Uh oh!

franva commented May 7, 2024

Uh oh!

cha0sbuster commented May 7, 2024 •

edited

Loading

Uh oh!

franva commented May 8, 2024

Uh oh!

Weroxig commented May 21, 2024

Uh oh!

Artefact2 commented May 23, 2024

Uh oh!

savchenko commented Nov 28, 2024

Uh oh!

alchemistbo commented Jan 21, 2025

Uh oh!

JamesFurriery commented Feb 12, 2025

Uh oh!

eatmelonhan commented Feb 16, 2025

Uh oh!

katfionn commented Jun 9, 2025

Uh oh!

afsara-ben commented Jun 18, 2025

Uh oh!

GaryMatthews commented Aug 12, 2025 •

edited

Loading

Uh oh!

Artefact2/README.md

Which GGUF is right for me? (Opinionated)

llama.cpp feature matrix

KL-divergence statistics for Mistral-7B

ROCm benchmarks for Mistral-7B

Mayorc1978 commented Mar 18, 2024

Uh oh!

Artefact2 commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diimdeep commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franva commented Apr 20, 2024

Uh oh!

cha0sbuster commented May 6, 2024

Uh oh!

franva commented May 7, 2024

Uh oh!

cha0sbuster commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franva commented May 8, 2024

Uh oh!

Weroxig commented May 21, 2024

Uh oh!

Artefact2 commented May 23, 2024

Uh oh!

savchenko commented Nov 28, 2024

Uh oh!

alchemistbo commented Jan 21, 2025

Uh oh!

JamesFurriery commented Feb 12, 2025

Uh oh!

eatmelonhan commented Feb 16, 2025

Uh oh!

katfionn commented Jun 9, 2025

Uh oh!

afsara-ben commented Jun 18, 2025

Uh oh!

GaryMatthews commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artefact2 commented Mar 18, 2024 •

edited

Loading

diimdeep commented Mar 28, 2024 •

edited

Loading

cha0sbuster commented May 7, 2024 •

edited

Loading

GaryMatthews commented Aug 12, 2025 •

edited

Loading