-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Dynamic temperature sampling for better coherence / creativity #3483
Comments
cc @KerfuffleV2 |
Definitely way more practical and the use case is also clearer to me than what you were talking about before. I also had the idea to do something similar with word boundaries. I.E. if you're generating something like "that is wh" then the temperature for tokes like |
I've been working on drafting this. Here's an interesting example of the Declaration of Independence and the measured top token probability for the next sentence when I gave it half of the first paragraph on Mistral 7b: As you can see it is not deterministic enough with a low-ish temp sampler config to prevent hallucinations in quotations from being reasonably possible; some tokens are more like 95%, 90%, rather than 99.9% as someone I talked to theorized would be the case (and that it would only be that instead of 100% because it has to avoid dividing by zero). Curiously, here's a natural language prompt: Quite bizarrely the variance is incredibly high for natural language. Some are quite undecided and go as low as 13% when it comes to their top token probability; some are very obvious (89%) in comparison. Standard deviation was proposed at first here to help measure 'confidence' and scale temperature accordingly, but that was more of a hunch and not necessarily the 'best' idea on how to implement the dynamic temp. There are a multitude of ways we could measure and score how confident a model is at predicting:
I will continue to update this if I make significant progress. |
Are those values after softmax? If not, comparing the absolute values between different runs might not really be meaningful. It's the logit value relative to other logits that determines which token gets picked, not really the absolute value. You didn't show the code or process you used to generate that output, so it's hard to comment. |
It was called right before temp sampling and with the other samplers (top p, etc) disabled but that might not have been completely accurate still, you're right on that. Though I didn't change the sampling settings between those two responses... |
I have a test implementation of this feature hard-coded right now in this GUI fork of llama.cpp (koboldcpp): I am calling it 'greedy dynamic temperature'. It seems to be doing decently well so far with the provided test values (min temp 0.1, max temp 1.5) when I tried its ability to continue long-form text. By that, I'm referring to completing partial passages of text that LLMs have 'memorized perfectly' (things like the Declaration of Independence as I mentioned), on a non-instruct model (just for testing). It is also doing creative / open ended text generation properly and I'm not seeing much repetition there. Will do more tests and a more 'proper' implementation of it so that this is its own option and not hardcoded. Then, if it has a good reception, I will consider a PR on the main repository here. If not, I will rethink the approach of using only the top token. |
I'm a bit confused by that code. The candidates aren't sorted until you call either What I'd do is just call the softmax sampler since you want the softmax value anyway. Then you'll know the logits are sorted and Also, from the pull: float prob_max_token_before_temp = expf(max_l - max_l) / sum_exp; In other words: float prob_max_token_before_temp = expf(0) / sum_exp; Right? |
I was not intending for this to have full compatibility with the other samplers until I could confirm that it was working to some degree, at that point I was going to make sure that it worked in tandem with different samplers (I'm prototyping without knowing a very good knowledge of the general codebase) Also yeah that max_l minus max_l is probably redundant |
You don't need to do that, you just need to run In other words just add Then you'll have the logits nicely sorted and the softmax values available in |
The test build of Koboldcpp is up: I've been getting a very positive reception so far, but the actual values used probably need better calibration. |
After some more research trying to figure out how to properly score distributions where there are many 'bad candidates', I discovered that the Gini coefficient is a way to directly mathematically measure inequality in a distribution. Using that as the measurement might be superior to entropy if we want to measure overall 'uncertainty' because it would weigh the disproportionately probable tokens as being more important in its scoring. So for a theoretical distribution like
It would reward a lower value to this than entropy would because entropy would care about the sum of lower probability values. Gini is biased towards higher probability values in the distribution when calculating its value, which is theoretically better for this use case. Also, I switched to a power function which seems reasonable / simpler for experimenting compared to a sigmoid: I will be updating this page on my efforts / progress implementing dynamic temp for those interested: |
I'm guessing the idea here is conditional determinism? Like whenever it starts a piece of a larger word you might want to ensure it to finishes that word with a higher degree of determinism / lower temperature rather than creating a pseudo-word. Also, I've posted another koboldcpp build where you can try out the Gini sampling approach (as well as Entropy sampling, and the original DynaTemp implementation, but Gini seems superior). I've gotten positive feedback so far. |
I did an experiment where I turned off Top K and Top P. No other samplers beyond dynamic temperature, and from 0.0 temp to 2.0 temp (linear mapping of HHI, which is sum squared probabilities, this metric is used to measure concentration of the probabilities). All 32,000 tokens were considered for the test. Strangely enough, the generations were either totally coherent and creative, or coherent for a bit but then started repeating 'nonsense'. So I measured the HHI distributions of both. Coherent Generations:
Incoherent Texts:
This is very interesting to me to see how one bad token chosen that is nonsensical can totally break the rest of the generation. I wonder if a running HHI measurement could be ran which dials back the temperature scaling whenever it shifts too far from the mean could help prevent this... (may not be worth it compared to just using Top P / Top K, but one might hope a universal sampler could exist) |
I'm getting very good results with my Min P sampler implementation. https://github.com/kalomaze/koboldcpp/releases/tag/minP This is pretty close to the niche 'Top A' sampling method except scaled linearly, which seems a lot more appropriate considering the probability distributions I've measured. I can say with confidence that this generally isolates the best tokens for sampling more consistently than how Top P currently samples. Let me break this down: Let's say in theory we have this distribution (assuming 1.0 temperature):
If Top P is used with a distribution like this, with a typical value such as 0.90, that means that in theory, Top P would include most of the 1% probabilities reaching for the total sum of 90%, making a bunch of low quality choices very likely. In practice, this does happen but to a less exaggerated extent; however, when the chance of choosing tail probabilities happens every token (and is sometimes exaggerated with temperature scaling), this eventually leads to compounding failure. Min P works differently. Let's assume my default of 0.05 (5%) value of Min P is used. For the same probability distribution:
0.05 would be scaled by the top probability as expressed as a decimal, 0.25. So 0.05 x 0.25 = 0.0125 (1.25%). Math wise, it seems to handle probabilities much better on average if our assumed goal is to cut out the tail end of the distribution in a simple and effective manner. Also, this isn't really 'evidence' as much as it is 'opinions from people I trust to give good subjective analysis', but I am hearing positive reports on my Min P test build for koboldcpp,. This is on top of my console logging showing that the method is cutting out the tail end very reasonably across probability distributions in a similar fashion to Top P: I will put up a PR with the current code asking for feedback on how to properly integrate this within llama.cpp. Dynamic Temp will probably stay as an experimental side project for now, but Min P as a sampler option seems immediately relevant and useful. Also, I would just call this "linear version of Top A" or something along those lines, but the problem is search engines do not like "Top A"... so I think a rename to something more distinct for the implementation is in order for the sake of it accessibility. |
The "cutting the tail" part sounds very similar to the tail free sampler. Have you already looked at that? (Locally typical also isn't completely different either.) |
I indeed have looked at TFS & Typical sampling. They did not get me the results I was looking for in this department, and the results of of how the values impacted those samplers didn't seem very easily interpretable, making them difficult to use as hyperparameters. I think for typical sampling specifically, a big problem is that it presumes "uncertain distributions mean that the generation is becoming atypical" rather than being open ended and having many valid choices. I will admit that, TFS seems quite mathematically dense and I don't fully understand it, and I didn't get good [subjective] results back when I tested different values. However this was when the Top K clamping bug was still in koboldcpp, so I'm not sure if that being fixed might affect the calculations of the derivatives in any meaningful way. This might be bias speaking, but I think I'm a fan of Occam's razor when it comes to sampler designs; it should be somewhat intuitive what it is directly accomplishing, otherwise the parameter isn't reasonably interpretable to configure for different scenarios (e.g deterministic, creative...) and it doesn't really see adoption. Also consider that the Rate of Change as a metric (seems to be?) a much messier metric to use on today's models compared to what existed at the time when TFS was created (GPT2); It seems less predictable of a metric to use across probability distributions on modern Llama models because the rate of change could be erratic or smooth? (Not confident about this, could be very wrong, calculus is not my strong suit lol) |
I just tried the exllamav2 implementation and it's very lit. I need to test if it hurts generation speed but the results are almost worth it. The writing is much more creative and really comes alive. |
It should be a very fast sampler. It just does softmax + sort (most existing samplers do this also), and then the worst case is to iterate the logits once. I doubt it would even have a measurable performance impact. |
The performance impact is not measurable. Sampler math tends to be extremely lightweight. |
It just bans tokens that don't match the grammar, it never makes any tokens more likely. |
#3841 got merged. Please test :) |
@kalomaze Just in case you're interested, I'm adding your Min-P sampler to my Rust sampling crate: KerfuffleV2/llm-samplers#9 (with appropriate credit of course) I've been using it lately it seems very useful. I used to use TFS and tail-free, top-k, top-p in combination but now I'm able to disable those and min-p produces pretty much equivalent results. I know you developed this independently, but BlinkDL's Top-A sampler idea is pretty similar: https://github.com/BlinkDL/RWKV-LM#the-top-a-sampling-method - it just uses a formula for the "you have to be this tall" threshold instead of a flat value. I'm not sure which approach works better. |
So just to confirm. When using min-p, we should always disable topk/top-P? |
There's no inherent conflict. Those both run before min-P though and softmax runs again so you may need to keep that in mind when tuning the the min-P threshold. |
Yup, Min P is essentially a linear Top A. Looking at actual distributions, I think it makes more sense to just linearly scale a 'floor' (required probability) based on the 'ceiling' (top token probability). It's more directly interpretable that way (e.g, 0.25 Min P is 'you must have at least 1/4th the probability of the top token'). I think it's fair to say that Min P outperforms TFS / Top P; TFS relies on the rate of change, which can be rocky on modern models, and Top P isn't considering the possibility of divided concentration of a few good choices amongst a sea of bad ones... |
As we reduce the pool of tokens (such as with min-p) we also increase the likelihood that the top token will be selected. I think there’s an issue with this. We only really get options when we have a large token pool, but then there’s a higher likelihood that the selected token will be of much lower quality. I think ideally we want to have a small pool of relatively high probability tokens and a higher likelihood that a non-top token will be selected. Temp helps with that, but not necessarily at the same rate that changes in pool size impact it, so right now there’s not a lot of control over the final token likelihood. So, I think we need to combine something like min-P with something that dynamically adjusts temp based on some combination of pool size and something like the difference between top token probability and the next token probability. thoughts? |
It sounds like you kind of want to take min-p and then normalize the top X tokens so they have about the same priority as the top one? Or at least so the probabilities are closer? |
I am using kalomaze's code for that in exlllama. Adjusting temperature from about .5-2.0 based on entropy. It was picking really low temps so I bump the 0 to .1 and then .5. There is an implementation for koboldcpp but not yet for llama.cpp main. They feel like the only samplers I need, tbh. |
To me that sounds like doing a softmax after sampling and then maybe some >1.0 temperature |
Kerfuffle, yes essentially. I don’t think we want them to be equivalently likely as that is likely to lead to craziness, but we want the curve flattened quite a bit. Once we’ve min-p’d the list, we should be relatively happy with the selection. I feel like, in reality we never really need more than the top handful of tokens as options but we need those tokens to be fairly possible to choose between. PhorKoz, do you find that method is giving you some a reasonable chance to not select the top token without overly reducing its likelihood too much? I’m not sure if the dynamic temp sampling method got moved forward at all or dropped when Kalomaze moved on to min-p. |
Still works in exllamav2 and I am getting the best replies I've gotten ever. I should definitely break out the logits viewer and see what's happening under the hood to make sure it isn't placebo. |
@kalomaze You have since implemented this in koboldcpp, right? Can you upstream that? |
@DutchEllie what specifically? min-p is in master. (and used by default) Line 17 in 8c58330
|
Kalomaze introduced min-p, but he also provided a DynaTemp implementation in koboldcpp (fork of this) recently. From what I hear it's quite good, so if it could be merged upstream here that'd be nice. |
Would it be possible to add @kalomaze's Cubic Sampling with Curve Params that he put up in text-generation-webui? I'm hearing people are getting good results: oobabooga/text-generation-webui#5551 |
In the latest "llama cpp" how do I use the "--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)" parameter? Suppose I want a range between 0 and 1? Is the correct format --dynatemp-range 0,1 ? I'm a bit confused as to how to use this with llama cpp command line and with batch files. |
From the source, it appears to be a single +- value. I assume it would work like: --temp 0.7 --dynatemp-range 0.3 That should give a range of 0.4 - 1.0 |
Yes, that's exactly how it works |
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Feature Idea
Typical sampling methods for large language models, such as Top P and Top K, (as well as alternative sampler modes that decide the Top K dynamically like Mirostat) are based off the assumption that a static temperature value (a consistently randomized probability distribution) is the ideal sampler conditioning. Mirostat, most notably, was designed to 'learn' a certain targeted level of 'entropy' over time; this helped the model find the most grammatically coherent selection of tokens to be considered by the sampler for good results. Most of these sampling implementations weren't designed to be used together. Some, like TFS, were created when the largest available models were smaller ones like GPT2. Those models struggled a lot more when attempting to generalize in different directions, and it makes sense to me that they'd need unique sampler tricks to keep them grammatically coherent.
I've tested and played around with these settings for Llama models, and while Mirostat seemed like a step in the right direction, especially for preventing repetition, I realized that nobody had made a sampler mode that would control temperature directly per token. My implementation of this would be calculated based on a simple metric; take the standard deviation of all tokens being considered by your top P / top K before applying the temperature randomization, and based on the 'confidence' of the model (as represented by the variation in choice), you can apply a temperature adjustment proportional to the variation of probability seen in the sampled set of tokens being chosen from.
The main idea is to encourage randomizing 'uncertain' probabilities (e.g, open ended writing, abstract concepts that can be represented with many words, and aren't deterministic by nature) while keeping the temperature low for more deterministic tokens without having to find the ideal selection of candidates for sampling per token (which I believe is how Mirostat was designed to work).
List of possible advantages could be:
List of possible disadvantages could be:
The text was updated successfully, but these errors were encountered: