Skip to content

Instantly share code, notes, and snippets.

@ubergarm
Last active May 4, 2025 16:13
Show Gist options
  • Save ubergarm/2aa9327f7b98a9b16fef62b4941c7e76 to your computer and use it in GitHub Desktop.
Save ubergarm/2aa9327f7b98a9b16fef62b4941c7e76 to your computer and use it in GitHub Desktop.
Visualize importance score statistics for three Qwen3-30B-A3B llama-imatrix files.
  1. Used @EAddario's PR ggml-org/llama.cpp#12718 to generate imatrix statistics.
  2. These were the imatrix data files used, and appear in each mosaic top to bottom in this order (barto, uber, unsloth)
  1. Similar to https://huggingface.co/ikawrakow/Qwen3-30B-A3B for https://huggingface.co/ikawrakow/Qwen3-30B-A3B but I didn't use the 128k usnloth one and I didn't see ik's to run.

See attached images below generated using some python/matplotlib/image magic scripts vibe coded using ubergarm/Qwen3-30B-A3B-mix-IQ3_K. You can click them to load them larger, they are not too big at 100dpi. You may need to shift-reload to refresh before clicking on them as possibly I attached them while this gist was being edited in private mode before making public.

attn_q

attn_q_mosaic

attn_k

attn_k_mosaic

attn_v

attn_v_mosaic

attn_output

attn_output_mosaic

ffn_gate_inp

ffn_gate_inp_mosaic

ffn_down_exps

ffn_down_exps_mosaic

ffn_gate_exps

ffn_gate_exps_mosaic

ffn_up_exps

ffn_up_exps_mosaic

output

output_mosaic (only ubergarm had the non-repeating output layer, probably because I used ik's fork to make the imatrix? I arbitrarily mapped it to layer "99" and the graph x-axis threw decimals but ignore that.)

@bartowski1182
Copy link

bartowski1182 commented May 4, 2025

The ffn gate and up experts for mine (assuming it's in the bottom) unsloth's are very strange in relation :o

@ubergarm
Copy link
Author

ubergarm commented May 4, 2025

I updated the gist to list the imatrix source in order top to bottom: bartowski, ubergarm, unsloth. If you click to enlarge an image the are labeled in the individual subtitles.

Though interestingly Dan (unsloth) seems to still be making more/new imatrix files for Qwen3-235B/30B MoE using longer context lengths than the default -c 512 that I'm using.

Using this recent note here on updated unsloth imatrix methodology and given they have access to 640GB VRAM machine that is enough to calculate imatrix on the 438G Qwen3-235B-A22B-BF16.

  1. In regards to PPL and KLD - yes KLD is better - but using our imatrix for these numbers is not correct - I used the chat template of the model itself and run imatrix on approx 6K to 12K context lengths, whilst I think the norm is to use 512 context length - comparing our imatrix is now not apples to apples anymore.

So presumably the command JohannesGaessler is asking for here might be something like:

./build/bin/llama-imatrix \
    -m Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-BF16-00001-of-00011.gguf \
    -f unsloth_calibration_Qwen3-235B-A22B.txt \ # <--- text containing chat template of model
    -o Qwen3-235B-A22B-GGUF/imatrix_unsloth.dat \
    --ctx-size 12288 \   # <--- default is 512, above note suggests unsloth is using 6144 - 12288
    --batch-size 2048 \  # <--- probably something bigger than this default in attempt to speed up on 8xH100s?
    --ubatch-size 512 \  # <--- probably something bigger...
    -ngl 99 \
    --threads 1

tbh I'm not sure how changing context from default of 512 to say 8k or 12k will effect PPL, KLD, benchmarks, and actual daily use for folks.

@bartowski1182
Copy link

Ah the length could also be related yes, I had experimented a little with using different context lengths but it's tricky because of the way that the entries are stored, compilade may have a better explanation but it can't just be done 1:1

@ubergarm
Copy link
Author

ubergarm commented May 4, 2025

Cross-referencing my comment that might shed some more light on potential effects: ggml-org/llama.cpp#13199 (comment)

fwiw my methodology for:

  1. llama-sweep-bench stuff is all in logs here
  2. My imatrix command is as follows running on ik_llama.cpp fork. Note I currently don't have enough VRAM+RAM to use bf16 as base for 235B, but use the bf16 for the 30B.
./build/bin/llama-imatrix \
    --verbosity 1 \
    --layer-similarity \
    -m /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0.gguf \
    -f calibration_data_v5_rc.txt \
    -o /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/imatrix-Qwen3-235B-A22B.dat \
    --ctx-size 512 \
    -ngl 34 \
    --threads 24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment