Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Dynamic temperature sampling for better coherence / creativity #3483

Closed
kalomaze opened this issue Oct 5, 2023 · 47 comments
Labels

Comments

@kalomaze
Copy link
Contributor

kalomaze commented Oct 5, 2023

Prerequisites

  • [✅] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Idea

Typical sampling methods for large language models, such as Top P and Top K, (as well as alternative sampler modes that decide the Top K dynamically like Mirostat) are based off the assumption that a static temperature value (a consistently randomized probability distribution) is the ideal sampler conditioning. Mirostat, most notably, was designed to 'learn' a certain targeted level of 'entropy' over time; this helped the model find the most grammatically coherent selection of tokens to be considered by the sampler for good results. Most of these sampling implementations weren't designed to be used together. Some, like TFS, were created when the largest available models were smaller ones like GPT2. Those models struggled a lot more when attempting to generalize in different directions, and it makes sense to me that they'd need unique sampler tricks to keep them grammatically coherent.

I've tested and played around with these settings for Llama models, and while Mirostat seemed like a step in the right direction, especially for preventing repetition, I realized that nobody had made a sampler mode that would control temperature directly per token. My implementation of this would be calculated based on a simple metric; take the standard deviation of all tokens being considered by your top P / top K before applying the temperature randomization, and based on the 'confidence' of the model (as represented by the variation in choice), you can apply a temperature adjustment proportional to the variation of probability seen in the sampled set of tokens being chosen from.

The main idea is to encourage randomizing 'uncertain' probabilities (e.g, open ended writing, abstract concepts that can be represented with many words, and aren't deterministic by nature) while keeping the temperature low for more deterministic tokens without having to find the ideal selection of candidates for sampling per token (which I believe is how Mirostat was designed to work).

List of possible advantages could be:

  • Having a definable range between the 'Base Temperature' and 'Maximum Temperature' could generally improve the creative problem solving ability of the model.
  • Certain tokens are highly important to the context and are more important than others. For example, if the probability was randomized too far for at least one token that represents something deterministic like a certain character in a programming syntax, this leads to a higher failure rate for the rest of the generation.
  • Could help prevent the model's generations from trending towards repetition due to a much broader range of probabilities that could be considered without impacting the model's intelligence as broadly (e.g a max temperature of 1.5 might not impact the model as strongly compared to if every token was sampled with that value). If this is the case, biasing against repeated tokens artificially through the Repetition Penalty would become less necessary.

List of possible disadvantages could be:

  • A lot of faith is being put in the idea that strong variations of possibilities have a correlation with a high amount of acceptable / reasonable tokens. If the correlation is mild, the default range values would have to be adjusted to accomodate this, but that could be mitigated by testing different values for the base/max temp range, or through benchmarking them individually.
  • The rate at which a model becomes more certain might not be linear; there might be a very short gap between 'low deviation' and 'high deviation' on unsampled probabilities.
  • Reproducability might be more difficult, but I'm unsure of this. I'm guessing you could just use the same seed for every temperature value variation.
@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 5, 2023

cc @KerfuffleV2
I think this is a more realistic sampler modification to implement compared to my last issue, do you have any opinions on this?

@Zhuyuqii
Copy link

Zhuyuqii commented Oct 5, 2023

https://arxiv.org/abs/2309.02772

@KerfuffleV2
Copy link
Collaborator

I think this is a more realistic sampler modification to implement compared to my last issue, do you have any opinions on this?

Definitely way more practical and the use case is also clearer to me than what you were talking about before.

I also had the idea to do something similar with word boundaries. I.E. if you're generating something like "that is wh" then the temperature for tokes like en, ere, at shouldn't necessarily be the same as dog since [wh]en], [wh]ere, etc. Also if you have that is a token like en or ere shouldn't necessarily have the same temperature as something like n't, what, etc. So it matters if you're in the middle of a word, and whether the token under consideration is something that would complete a word or start a new one.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 6, 2023

I've been working on drafting this. Here's an interesting example of the Declaration of Independence and the measured top token probability for the next sentence when I gave it half of the first paragraph on Mistral 7b:

image

As you can see it is not deterministic enough with a low-ish temp sampler config to prevent hallucinations in quotations from being reasonably possible; some tokens are more like 95%, 90%, rather than 99.9% as someone I talked to theorized would be the case (and that it would only be that instead of 100% because it has to avoid dividing by zero).

Curiously, here's a natural language prompt:

image

Quite bizarrely the variance is incredibly high for natural language. Some are quite undecided and go as low as 13% when it comes to their top token probability; some are very obvious (89%) in comparison.

Standard deviation was proposed at first here to help measure 'confidence' and scale temperature accordingly, but that was more of a hunch and not necessarily the 'best' idea on how to implement the dynamic temp.

There are a multitude of ways we could measure and score how confident a model is at predicting:

  • Standard deviation as already mentioned
  • A 'greedy' and simplistic method that only uses the top tokens percentage and scales it. I was thinking an exponential curve where 100% probability and 90% probability would both be close to zero, but the lower you go the more aggressively it scales the temperature value up.
  • First-order derivatives (differences between adjacent probabilities) are calculated to determine the rate of change of the probabilities, and this is used as a general metric for 'confidence'. For instance, [80%, 20%] would be considered less confident compared to [80%, 10%, 10%]

I will continue to update this if I make significant progress.

@KerfuffleV2
Copy link
Collaborator

Are those values after softmax? If not, comparing the absolute values between different runs might not really be meaningful. It's the logit value relative to other logits that determines which token gets picked, not really the absolute value.

You didn't show the code or process you used to generate that output, so it's hard to comment.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 7, 2023

It was called right before temp sampling and with the other samplers (top p, etc) disabled but that might not have been completely accurate still, you're right on that. Though I didn't change the sampling settings between those two responses...

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 8, 2023

I have a test implementation of this feature hard-coded right now in this GUI fork of llama.cpp (koboldcpp):
LostRuins#464

I am calling it 'greedy dynamic temperature'.
This is because I'm only taking the top token's probability % and scaling the temperature value based on that with an exponential curve. It's applying that scaling curve to decide the temperature value, so that high probability values like 90% and 100% are both close to the minimum temperature value, but 40% and 50% is a more pronounced difference (the lower you go in confidence, the closer you are to the max temperature value)
I've labeled this approach as 'greedy' as it solely relies on the top token's probability for the adjustment. But that could be all we need...

It seems to be doing decently well so far with the provided test values (min temp 0.1, max temp 1.5) when I tried its ability to continue long-form text. By that, I'm referring to completing partial passages of text that LLMs have 'memorized perfectly' (things like the Declaration of Independence as I mentioned), on a non-instruct model (just for testing). It is also doing creative / open ended text generation properly and I'm not seeing much repetition there.

Will do more tests and a more 'proper' implementation of it so that this is its own option and not hardcoded. Then, if it has a good reception, I will consider a PR on the main repository here. If not, I will rethink the approach of using only the top token.

@KerfuffleV2
Copy link
Collaborator

I'm a bit confused by that code. The candidates aren't sorted until you call either llama_sample_softmax or llama_sample_top_k. Also there are other samplers that can change the order so they'll only be sorted if one of those two functions got called and no other sampler that changes the order was called afterwards. There's also a candidates->sorted flag which you can use to check if they're sorted. You can't necessarily assume they'll just always be sorted. For example, when using mirostat samplers everything else except temperature gets skipped, so in that case the temperature sampler will get called first, then the mirostat sampler - so the logits won't be sorted at that point.

What I'd do is just call the softmax sampler since you want the softmax value anyway. Then you'll know the logits are sorted and candidates->data[i].p will have the softmax value.

Also, from the pull:

float prob_max_token_before_temp = expf(max_l - max_l) / sum_exp;

In other words:

float prob_max_token_before_temp = expf(0) / sum_exp;

Right? max_l - max_l has to be 0 (except if it was NaN but that shouldn't be a case you run into).

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 8, 2023

I was not intending for this to have full compatibility with the other samplers until I could confirm that it was working to some degree, at that point I was going to make sure that it worked in tandem with different samplers (I'm prototyping without knowing a very good knowledge of the general codebase)
Also I'm currently investigating a potential issue with how it scales (in terms of the curve) so that will have to wait. But thank you for pointing it out still. As soon as I ensure that I'm scaling in the way I initially intended I'll try to figure out how to make sure Mirostat is disabled when this is on and that softmax has been called when measuring.

Also yeah that max_l minus max_l is probably redundant

@KerfuffleV2
Copy link
Collaborator

I'll try to figure out how to make sure Mirostat is disabled when this is on

You don't need to do that, you just need to run llama_sample_softmax rather than assuming the logits are already sorted when your sampler is reached.

In other words just add llama_sample_softmax(nullptr, candidates); after the line near the top when you start timing the sampler.

Then you'll have the logits nicely sorted and the softmax values available in .p

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 8, 2023

I'll try to figure out how to make sure Mirostat is disabled when this is on

You don't need to do that, you just need to run llama_sample_softmax rather than assuming the logits are already sorted when your sampler is reached.

In other words just add llama_sample_softmax(nullptr, candidates); after the line near the top when you start timing the sampler.

Then you'll have the logits nicely sorted and the softmax values available in .p

Thank you!

Also, this test is not very comprehensive and isn't the most accurate, especially with a GPT4 judge, but the intention was to see if there was any discernable general trends even without a lot of data (and quickly without having to research quote origins...)

image

However, it has useful data, if a bit misleading if taken too literally (like most LLM benchmarks).
The rate at which it just completely makes up quotes was much higher in non-dynamic sampling (of course, both attempts had a bunch of well known misleading quotes - but for the purposes of what this test was benchmarking, that wasn't strictly important)

I used a pretty uninteresting Mistral finetune that typically suffers from sampling issues for my testing, because I notice with a lower temp (e.g 0.6), it starts repeating and hallucinating. I chose 0.9 to avoid that here for the non-dynamic temp.
2.0 max temp with 2 k scaling value doesn't struggle with that nearly much on the same exact model, from other anecdotal testing (again, not thoroughly benchmarked, just testing the waters here)

Here's how that scale looks on a graph:
image

And the formula for it:
image

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 9, 2023

The latest commit takes a different approach for the formula now, where it is being represented as a sigmoid function.
Two presets are mapped to '1.93' and '1.94' values of temperature for now, and will be overridden if those values are used (temporarily) until a proper full implementation is put into place.

image

This is the basic test preset for 1.93.
1.94 uses a max of 2.0 instead, and will be more dramatic temp scaling wise.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 10, 2023

The test build of Koboldcpp is up:
https://github.com/kalomaze/koboldcpp/releases/tag/dynamic-test

I've been getting a very positive reception so far, but the actual values used probably need better calibration.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 12, 2023

image
Entropy sampling!
My math might be wrong here in some fashion (the idea of Shannon entropy is new to me, as well as C++ in general... so bear with me lol) but I noticed that this implementation is working well so far from basic testing.

The concept is essentially to ensure that, when there's total evenness in the probability distribution, let's say 'perfect' evenness, that should theoretically scale all the way to 2.0 temperature (or whatever you set as maxTemp).
The inverse of this, which is full confidence in just the top token / next to no variation at all, would have nearly 0 temperature.

This is to avoid the pitfalls of relying on a single top token:
image

I haven't pushed it yet, but I'll double check to see if my implementation for this is better than what I came up with before or if it needs adjustments implementation wise.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 12, 2023

image
Static Top K of 40 was used, no Top P (or other samplers) were used whatsoever.

This comparison is somewhat misleading, because whether or not it uses the pattern of Celebrity - Birthday vs Birthday - Celebrity absolutely matters semantically in terms of how the model learned... but even that proves that just one bad prediction that's allowed to happen because of a sensitive temperature, can eventually lead to many bad predictions / compounding failure.
Also notice how it ends abruptly when using 1.0 temp.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 13, 2023

image
The green line here represents me asking it to generate Python code, in which multiple stopping points to elaborate on what the code was doing were made. The blue line represents an Essay generation. Both were stopped at 500 tokens.
You can observe that there's overall more certainty across the predictions on average for a code generation compared to the open-ended essay generation.
image

The model I am using is not finetuned for code and is biased toward storywriting / chat, so a better comparison would be using a code llama model and comparing that to a storywriting finetune, but you can still notice an obvious trend even on the same model using different prompts. To me, this is evidence it's a solid metric for scaling temperature randomization.

@kalomaze
Copy link
Contributor Author

image
This might work better for entropy sampling and would be more straightforward to adjust. There'd simply be min and max temperature and the very start of the curve would decay quickly. Will test this out today

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 15, 2023

After some more research trying to figure out how to properly score distributions where there are many 'bad candidates', I discovered that the Gini coefficient is a way to directly mathematically measure inequality in a distribution. Using that as the measurement might be superior to entropy if we want to measure overall 'uncertainty' because it would weigh the disproportionately probable tokens as being more important in its scoring.

So for a theoretical distribution like

  1. 75%
  2. 2.5%
  3. 2.5%
  4. 2.5%
  5. 2.5%
  6. 2.5%
  7. 2.5%
  8. 2.5%
  9. 2.5%
  10. 2.5%
  11. 2.5%

It would reward a lower value to this than entropy would because entropy would care about the sum of lower probability values. Gini is biased towards higher probability values in the distribution when calculating its value, which is theoretically better for this use case.

image

Also, I switched to a power function which seems reasonable / simpler for experimenting compared to a sigmoid:
image

I will be updating this page on my efforts / progress implementing dynamic temp for those interested:
https://rentry.org/dynamic_temperature

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 15, 2023

I also had the idea to do something similar with word boundaries. I.E. if you're generating something like "that is wh" then the temperature for tokes like en, ere, at shouldn't necessarily be the same as dog since [wh]en], [wh]ere, etc.

I'm guessing the idea here is conditional determinism? Like whenever it starts a piece of a larger word you might want to ensure it to finishes that word with a higher degree of determinism / lower temperature rather than creating a pseudo-word.
If so that reminds me of the AdapT paper that was posted at the start of this issue where they were trying to find arbitrary 'conditions' to trigger a different temperature. Which does work, but I'm thinking a generalized dynamic temp would be best.

Also, I've posted another koboldcpp build where you can try out the Gini sampling approach (as well as Entropy sampling, and the original DynaTemp implementation, but Gini seems superior). I've gotten positive feedback so far.

@kalomaze
Copy link
Contributor Author

image

I did an experiment where I turned off Top K and Top P. No other samplers beyond dynamic temperature, and from 0.0 temp to 2.0 temp (linear mapping of HHI, which is sum squared probabilities, this metric is used to measure concentration of the probabilities). All 32,000 tokens were considered for the test. Strangely enough, the generations were either totally coherent and creative, or coherent for a bit but then started repeating 'nonsense'. So I measured the HHI distributions of both.

Coherent Generations:

  • Mean HHI: ~0.309
    -Standard Deviation: ~0.288
    -Median (50%): ~0.236
    -The values range from 0 to ~0.892

Incoherent Texts:

  • Mean HHI: ~0.661
  • Standard Deviation: ~0.357
  • Median (50%): ~0.802
  • The values range from 0 to ~0.999

This is very interesting to me to see how one bad token chosen that is nonsensical can totally break the rest of the generation. I wonder if a running HHI measurement could be ran which dials back the temperature scaling whenever it shifts too far from the mean could help prevent this... (may not be worth it compared to just using Top P / Top K, but one might hope a universal sampler could exist)

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 28, 2023

I'm getting very good results with my Min P sampler implementation.

https://github.com/kalomaze/koboldcpp/releases/tag/minP
https://github.com/kalomaze/text-generation-webui/releases/tag/minp-exllama2

image

This is pretty close to the niche 'Top A' sampling method except scaled linearly, which seems a lot more appropriate considering the probability distributions I've measured. I can say with confidence that this generally isolates the best tokens for sampling more consistently than how Top P currently samples. Let me break this down:

Let's say in theory we have this distribution (assuming 1.0 temperature):

    1. 25% Probability
    1. 24% Probability
    1. 1% Probability
      ... and then another 50 tokens that are all 1% ...

If Top P is used with a distribution like this, with a typical value such as 0.90, that means that in theory, Top P would include most of the 1% probabilities reaching for the total sum of 90%, making a bunch of low quality choices very likely. In practice, this does happen but to a less exaggerated extent; however, when the chance of choosing tail probabilities happens every token (and is sometimes exaggerated with temperature scaling), this eventually leads to compounding failure.

Min P works differently. Let's assume my default of 0.05 (5%) value of Min P is used. For the same probability distribution:

    1. 25% Probability
    1. 24% Probability
    1. 1% Probability
      ... and then another 50 tokens that are all 1% ...

0.05 would be scaled by the top probability as expressed as a decimal, 0.25. So 0.05 x 0.25 = 0.0125 (1.25%).
Therefore, only probabilities over 1.25% would be chosen.

Math wise, it seems to handle probabilities much better on average if our assumed goal is to cut out the tail end of the distribution in a simple and effective manner.

Also, this isn't really 'evidence' as much as it is 'opinions from people I trust to give good subjective analysis', but I am hearing positive reports on my Min P test build for koboldcpp,.

image image

This is on top of my console logging showing that the method is cutting out the tail end very reasonably across probability distributions in a similar fashion to Top P:

image

I will put up a PR with the current code asking for feedback on how to properly integrate this within llama.cpp. Dynamic Temp will probably stay as an experimental side project for now, but Min P as a sampler option seems immediately relevant and useful.

Also, I would just call this "linear version of Top A" or something along those lines, but the problem is search engines do not like "Top A"... so I think a rename to something more distinct for the implementation is in order for the sake of it accessibility.
I am willing to hear other interpretations of how this could be named beyond "Min P".

image

@KerfuffleV2
Copy link
Collaborator

The "cutting the tail" part sounds very similar to the tail free sampler. Have you already looked at that? (Locally typical also isn't completely different either.)

  1. https://trentbrick.github.io/Tail-Free-Sampling/
  2. https://arxiv.org/abs/2202.00666

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 28, 2023

The "cutting the tail" part sounds very similar to the tail free sampler. Have you already looked at that? (Locally typical also isn't completely different either.)

  1. https://trentbrick.github.io/Tail-Free-Sampling/
  2. https://arxiv.org/abs/2202.00666

I indeed have looked at TFS & Typical sampling. They did not get me the results I was looking for in this department, and the results of of how the values impacted those samplers didn't seem very easily interpretable, making them difficult to use as hyperparameters. I think for typical sampling specifically, a big problem is that it presumes "uncertain distributions mean that the generation is becoming atypical" rather than being open ended and having many valid choices.

I will admit that, TFS seems quite mathematically dense and I don't fully understand it, and I didn't get good [subjective] results back when I tested different values. However this was when the Top K clamping bug was still in koboldcpp, so I'm not sure if that being fixed might affect the calculations of the derivatives in any meaningful way.

This might be bias speaking, but I think I'm a fan of Occam's razor when it comes to sampler designs; it should be somewhat intuitive what it is directly accomplishing, otherwise the parameter isn't reasonably interpretable to configure for different scenarios (e.g deterministic, creative...) and it doesn't really see adoption.

image

Also consider that the Rate of Change as a metric (seems to be?) a much messier metric to use on today's models compared to what existed at the time when TFS was created (GPT2); It seems less predictable of a metric to use across probability distributions on modern Llama models because the rate of change could be erratic or smooth? (Not confident about this, could be very wrong, calculus is not my strong suit lol)

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 29, 2023

I just tried the exllamav2 implementation and it's very lit. I need to test if it hurts generation speed but the results are almost worth it. The writing is much more creative and really comes alive.

@KerfuffleV2
Copy link
Collaborator

It should be a very fast sampler. It just does softmax + sort (most existing samplers do this also), and then the worst case is to iterate the logits once. I doubt it would even have a measurable performance impact.

@kalomaze
Copy link
Contributor Author

kalomaze commented Oct 29, 2023

I just tried the exllamav2 implementation and it's very lit. I need to test if it hurts generation speed but the results are almost worth it. The writing is much more creative and really comes alive.

The performance impact is not measurable. Sampler math tends to be extremely lightweight.
There is an exception though. The GBNF Grammar sampler has some nested recursion atm, I unfortunately get huge degradation in token gen speed when using it (I also notice that it seems to randomly 'force' certain tokens to be chosen in or conditions, perhaps a refactor is in order for that...)

image

@KerfuffleV2
Copy link
Collaborator

I also notice that it seems to randomly 'force' certain tokens to be chosen

It just bans tokens that don't match the grammar, it never makes any tokens more likely.

@Green-Sky
Copy link
Collaborator

#3841 got merged. Please test :)

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 6, 2023

@kalomaze Just in case you're interested, I'm adding your Min-P sampler to my Rust sampling crate: KerfuffleV2/llm-samplers#9 (with appropriate credit of course)

I've been using it lately it seems very useful. I used to use TFS and tail-free, top-k, top-p in combination but now I'm able to disable those and min-p produces pretty much equivalent results.

I know you developed this independently, but BlinkDL's Top-A sampler idea is pretty similar: https://github.com/BlinkDL/RWKV-LM#the-top-a-sampling-method - it just uses a formula for the "you have to be this tall" threshold instead of a flat value. I'm not sure which approach works better.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 7, 2023

So just to confirm. When using min-p, we should always disable topk/top-P?

@KerfuffleV2
Copy link
Collaborator

When using min-p, we should always disable topk/top-P?

There's no inherent conflict. Those both run before min-P though and softmax runs again so you may need to keep that in mind when tuning the the min-P threshold.

@kalomaze
Copy link
Contributor Author

kalomaze commented Nov 8, 2023

@kalomaze Just in case you're interested, I'm adding your Min-P sampler to my Rust sampling crate: KerfuffleV2/llm-samplers#9 (with appropriate credit of course)

I've been using it lately it seems very useful. I used to use TFS and tail-free, top-k, top-p in combination but now I'm able to disable those and min-p produces pretty much equivalent results.

I know you developed this independently, but BlinkDL's Top-A sampler idea is pretty similar: https://github.com/BlinkDL/RWKV-LM#the-top-a-sampling-method - it just uses a formula for the "you have to be this tall" threshold instead of a flat value. I'm not sure which approach works better.

Yup, Min P is essentially a linear Top A. Looking at actual distributions, I think it makes more sense to just linearly scale a 'floor' (required probability) based on the 'ceiling' (top token probability). It's more directly interpretable that way (e.g, 0.25 Min P is 'you must have at least 1/4th the probability of the top token').

I think it's fair to say that Min P outperforms TFS / Top P; TFS relies on the rate of change, which can be rocky on modern models, and Top P isn't considering the possibility of divided concentration of a few good choices amongst a sea of bad ones...

@pacmanincarnate
Copy link

As we reduce the pool of tokens (such as with min-p) we also increase the likelihood that the top token will be selected. I think there’s an issue with this. We only really get options when we have a large token pool, but then there’s a higher likelihood that the selected token will be of much lower quality. I think ideally we want to have a small pool of relatively high probability tokens and a higher likelihood that a non-top token will be selected. Temp helps with that, but not necessarily at the same rate that changes in pool size impact it, so right now there’s not a lot of control over the final token likelihood.

So, I think we need to combine something like min-P with something that dynamically adjusts temp based on some combination of pool size and something like the difference between top token probability and the next token probability.

thoughts?

@KerfuffleV2
Copy link
Collaborator

It sounds like you kind of want to take min-p and then normalize the top X tokens so they have about the same priority as the top one? Or at least so the probabilities are closer?

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 19, 2023

I am using kalomaze's code for that in exlllama. Adjusting temperature from about .5-2.0 based on entropy. It was picking really low temps so I bump the 0 to .1 and then .5. There is an implementation for koboldcpp but not yet for llama.cpp main.

They feel like the only samplers I need, tbh.

@Green-Sky
Copy link
Collaborator

To me that sounds like doing a softmax after sampling and then maybe some >1.0 temperature

@pacmanincarnate
Copy link

Kerfuffle, yes essentially. I don’t think we want them to be equivalently likely as that is likely to lead to craziness, but we want the curve flattened quite a bit.

Once we’ve min-p’d the list, we should be relatively happy with the selection. I feel like, in reality we never really need more than the top handful of tokens as options but we need those tokens to be fairly possible to choose between.

PhorKoz, do you find that method is giving you some a reasonable chance to not select the top token without overly reducing its likelihood too much? I’m not sure if the dynamic temp sampling method got moved forward at all or dropped when Kalomaze moved on to min-p.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 20, 2023

Still works in exllamav2 and I am getting the best replies I've gotten ever. I should definitely break out the logits viewer and see what's happening under the hood to make sure it isn't placebo.

@DutchEllie
Copy link

@kalomaze You have since implemented this in koboldcpp, right? Can you upstream that?

@Green-Sky
Copy link
Collaborator

@DutchEllie what specifically? min-p is in master. (and used by default)

float min_p = 0.05f; // 0.0 = disabled

@DutchEllie
Copy link

@DutchEllie what specifically? min-p is in master. (and used by default)

float min_p = 0.05f; // 0.0 = disabled

Kalomaze introduced min-p, but he also provided a DynaTemp implementation in koboldcpp (fork of this) recently. From what I hear it's quite good, so if it could be merged upstream here that'd be nice.

@igorbarshteyn
Copy link

Would it be possible to add @kalomaze's Cubic Sampling with Curve Params that he put up in text-generation-webui? I'm hearing people are getting good results: oobabooga/text-generation-webui#5551

@joshknnd1982
Copy link

In the latest "llama cpp" how do I use the "--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)" parameter? Suppose I want a range between 0 and 1? Is the correct format --dynatemp-range 0,1 ? I'm a bit confused as to how to use this with llama cpp command line and with batch files.

@github-actions github-actions bot added the stale label Apr 16, 2024
@NeedsLoomis
Copy link

In the latest "llama cpp" how do I use the "--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)" parameter?

From the source, it appears to be a single +- value. I assume it would work like:

--temp 0.7 --dynatemp-range 0.3

That should give a range of 0.4 - 1.0

@l3utterfly
Copy link
Contributor

Yes, that's exactly how it works

@ZoomRmc
Copy link

ZoomRmc commented Apr 26, 2024

I'm a bit confused as to how to use this with llama cpp command line and with batch files.

  • --dynatemp-range is a maximum deviation from the base --temp. This parameter is rather inaccurately named, as it's not a range, but a swing from temp - dynatemp-range to temp + dynatemp-range. So, with the temp=1.0 and dynatemp-range=0.5 the possible temperature values for tokens lie in [0.5..1.5] range.
    A better name for the argument would be dynatemp-deviation or dynatemp-swing.

275295906-e6ef2b17-8dec-4ff1-870d-b115a3453b99

  • --dynatemp-exp controls the curvature or rate of change of temperature for a token. It determines the shape of the curve relating token's entropy (uncertainty) and the temperature value.
    --dynatemp-exp below 1.0 result in a concave upward curve, where the temperature increases rapidly for lower entropy but levels off and increases more gradually when entropy approaches 1.0.
    --dynatemp-exp above 1.0 produce a convex upward curve, where the temperature initially increases slowly for low entropy but then rises more steeply as it gets closer to 1.0.
    --dynatemp-exp of 1.0 sets a constant rate of change between the entropy and the temperature, making it a linear relation.

@github-actions github-actions bot removed the stale label Apr 27, 2024
@github-actions github-actions bot added the stale label May 28, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests