Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.
- Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
- 1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
- Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4
- Top K sets a hard limit on how many tokens can be selected. So top_k = 5 would mean you only allow the model to pick from the top 5 candidates and nothing else. This is considered a very "naive" and simplistic way to truncate choices.
- Graph demonstration with voiceover: https://files.catbox.moe/g1qfgz.mp4
- Top P adds up the topmost tokens until hitting a target percentage. So for example, if you have 25%, 25%, 12.5%, 12.5%, and top_p = 0.50, it will only consider the top two candiates.
- In my opinion, this is too linear and doesn't work very well compared to other techniques that have succeeded it. For example, 0.24 Top P would have led to only the top token being considered there, despite the fact 25% and 25% are equally probable events.
- Graph demonstration with voiceover: https://files.catbox.moe/9f0zd7.mp4
- Min P is a sampler I designed that operates differently from Top K or Top P. We set a minimum percentage requirement to consider tokens relative to the largest token probability.
- This means that, for example, Min P = 0.1 is equivalent to only considering tokens at least 1/10th the top token probability.
- Graph demonstration with voiceover: https://files.catbox.moe/wx12cl.mp4
- This sampling method differs from the truncation samplers (Top K, Top P, Min P) in that it is doing something that is fundamentally different to the raw token scores.
- We are tweaking the logits using a quadratic transformation, based on each token score's distance from the top token (the transformation centers on the top logit.) The coefficient is decided by the "smoothing factor" value.
- This is hard to explain without looking at the visualization, but the idea is that we make the topmost tokens more evenly probable while reducing the probability of extremely unlikely tokens.
- Higher values will be more deterministic, but it doesn't work quite like lower temperature would, as the scores of extremely closely competing top tokens will barely change. So if the original probabilities were 50/50 on the top two tokens, they will likely remain that way with higher smoothing factor values.
- The idea is that this can be used as an "all in one" sampler by itself, or in tandem with other methods if desired.
- Graph demonstration with voiceover: https://files.catbox.moe/x804ia.mp4
- This is a technique that applies a small negative bias to all tokens that have appeared so far to avoid repetition. 1.0 implies no change to the scores, while 1.25 is considered somewhat extreme.
- I try to avoid this as much as possible or only use it in small amounts if necessary because of the adverse consequences that it tends to introduce.
- This does not apply a bias based on how many times a word has occured, as doing that would constantly break basic english grammar rules. You can think of it more as a "History Penalty" or a "Context Penalty".
- Cannot be easily visualized because it does not stand alone by itself and it depends on prior context to work.
- The order in which samplers are applied matters and can meaningfully change the output.
- For example, if Temperature comes first in the order before Min P, then your Temperature value would change the output probabilities that Min P judges, and it will truncate differently.
- If Min P comes before Temperature, then the original probabilities are measured first, which means Temperature will only affect the tokens you decided to keep using Min P.
- For this reason, text-generation-webui & koboldcpp both have customizable sampler orders that you can control. There is no objectively "correct" order, but Temperature last will keep Min P's measurements consistent no matter the Temperature if that is desired.
Incredible, thanks for this!
A couple of quick questions, it would be great if you expand a little: