Skip to content

Instantly share code, notes, and snippets.

@kalomaze
Last active September 26, 2025 17:02
Show Gist options
  • Save kalomaze/4473f3f975ff5e5fade06e632498f73e to your computer and use it in GitHub Desktop.
Save kalomaze/4473f3f975ff5e5fade06e632498f73e to your computer and use it in GitHub Desktop.
LLM Samplers Explained

LLM Samplers Explained

Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.

OpenAI Samplers

Temperature

  • Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
  • 1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
  • Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4

Top K

  • Top K sets a hard limit on how many tokens can be selected. So top_k = 5 would mean you only allow the model to pick from the top 5 candidates and nothing else. This is considered a very "naive" and simplistic way to truncate choices.
  • Graph demonstration with voiceover: https://files.catbox.moe/g1qfgz.mp4

Top P

  • Top P adds up the topmost tokens until hitting a target percentage. So for example, if you have 25%, 25%, 12.5%, 12.5%, and top_p = 0.50, it will only consider the top two candiates.
  • In my opinion, this is too linear and doesn't work very well compared to other techniques that have succeeded it. For example, 0.24 Top P would have led to only the top token being considered there, despite the fact 25% and 25% are equally probable events.
  • Graph demonstration with voiceover: https://files.catbox.moe/9f0zd7.mp4

Alternative Samplers

Min P

  • Min P is a sampler I designed that operates differently from Top K or Top P. We set a minimum percentage requirement to consider tokens relative to the largest token probability.
  • This means that, for example, Min P = 0.1 is equivalent to only considering tokens at least 1/10th the top token probability.
  • Graph demonstration with voiceover: https://files.catbox.moe/wx12cl.mp4

Smooth Sampling / Quadratic Sampling

  • This sampling method differs from the truncation samplers (Top K, Top P, Min P) in that it is doing something that is fundamentally different to the raw token scores.
  • We are tweaking the logits using a quadratic transformation, based on each token score's distance from the top token (the transformation centers on the top logit.) The coefficient is decided by the "smoothing factor" value.
  • This is hard to explain without looking at the visualization, but the idea is that we make the topmost tokens more evenly probable while reducing the probability of extremely unlikely tokens.
  • Higher values will be more deterministic, but it doesn't work quite like lower temperature would, as the scores of extremely closely competing top tokens will barely change. So if the original probabilities were 50/50 on the top two tokens, they will likely remain that way with higher smoothing factor values.
  • The idea is that this can be used as an "all in one" sampler by itself, or in tandem with other methods if desired.
  • Graph demonstration with voiceover: https://files.catbox.moe/x804ia.mp4

Repetition Penalty

  • This is a technique that applies a small negative bias to all tokens that have appeared so far to avoid repetition. 1.0 implies no change to the scores, while 1.25 is considered somewhat extreme.
  • I try to avoid this as much as possible or only use it in small amounts if necessary because of the adverse consequences that it tends to introduce.
  • This does not apply a bias based on how many times a word has occured, as doing that would constantly break basic english grammar rules. You can think of it more as a "History Penalty" or a "Context Penalty".
  • Cannot be easily visualized because it does not stand alone by itself and it depends on prior context to work.

Sampler Orders

  • The order in which samplers are applied matters and can meaningfully change the output.
  • For example, if Temperature comes first in the order before Min P, then your Temperature value would change the output probabilities that Min P judges, and it will truncate differently.
  • If Min P comes before Temperature, then the original probabilities are measured first, which means Temperature will only affect the tokens you decided to keep using Min P.
  • For this reason, text-generation-webui & koboldcpp both have customizable sampler orders that you can control. There is no objectively "correct" order, but Temperature last will keep Min P's measurements consistent no matter the Temperature if that is desired.
@morbidCode
Copy link

I am blind and using a screen reading software. The voice overs in the graphs are very invaluable. Thanks!

@TheLonelyDevil9
Copy link

Can you suggest the best sampler order assuming min_p and XTC are not in use here?

firefox_bbrTGjxNJF
firefox_CjeqFRK5iB

@CHNtentes
Copy link

Incredible, thanks for this!

A couple of quick questions, it would be great if you expand a little:

  1. Do current SOTA models use a combination of all of these approaches?
  2. When you have a set of tokens under consideration, let's say 5 from Top K or Top P sampling, how is the next token from that set actually selected?
  1. Model themselves don't set these parameters. You can manually set them when using API from ChatGPT or Claude, or when calling model.generate() for local models.
  2. Randomly, based on the original probablity of those 5 tokens.

@AphidGit
Copy link

AphidGit commented Feb 3, 2025

These demonstration graphs are neat.
Would it be possible to provide a link to an interactive version?

@DomHudson
Copy link

@CHNtentes

Model themselves don't set these parameters. You can manually set them when using API from ChatGPT or Claude, or when calling model.generate() for local models.

The most popular way to interface with SOTA models are via consumer-friendly UIs which don't expose these parameters to the user. Any idea which sampling combinations they (e.g. ChatGPT) use there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment