-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix incorrectly reported token probabilities #7125
server: fix incorrectly reported token probabilities #7125
Conversation
examples/server/server.cpp
Outdated
llama_sample_top_k(ctx, &cur_p, n_probs, 0); | ||
} | ||
|
||
if (slot.sparams.temp <= 0.0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check just equality?
if (slot.sparams.temp <= 0.0f) { | |
if (slot.sparams.temp == 0.0f) { |
For sub-zero temperatures, we calculate the probabilities, though they are never used:
Lines 196 to 202 in 3af34c1
if (temp < 0.0) { | |
// greedy sampling, with probs | |
llama_sample_softmax(ctx_main, &cur_p); | |
id = cur_p.data[0].id; | |
} else if (temp == 0.0) { | |
// greedy sampling, no probs | |
id = llama_sample_token_greedy(ctx_main, &cur_p); |
I guess either way makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think that it's better to be consistent if at all possible. I changed it to an equality and added a note to the documentation. (I also noticed and fixed a bug where for temperature 0.0f the "top tokens" were wrong.)
Fixes #7093 .
The problem on master is that the server unconditionally reports the token probabilities stored for the top tokens even if they were not actually considered for sampling. This then, among other things, causes the sum of token probabilities to exceed 100%. In total this PR fixes the following issues:
To make this work I am extending
ctx_sampling
with a propertyn_considered
that stores how many of the top tokens were actually used for sampling. This can then be used to determine if and how many top tokens need to be fetched and starting at what index all token probabilities should be zero.