Skip to content

server: apply grammar before other samplers #12643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JohannesGaessler
Copy link
Collaborator

Context: I've recently started working on code for evaluating language model benchmarks such as MMLU using the llama.cpp server. The approach I'm taking is to present the model with a multiple choice question, restrict the output to exactly the given choices with a grammar, and to then retrieve the post-sampling probabilities for the choices. The individual probabilities for the choices should then add up to 1.

I noticed that with the current server code, when disabling all of the chain samplers except softmax and using a grammar, the post-sampling probabilities of the tokens allowed by the grammar don't consistently add up to 1. The reason is that tokens disallowed by the grammar can be assigned non-zero probabilities. This is because common_sampler_sample has a boolean parameter grammar_first which the server sets to false. The probabilities are then those of the softmax and they are not being recalculated when the function simply checks whether the sampled token is allowed by the grammar. On the other hand, if the token is disallowed by the grammar, then the sampling is re-run with the grammar being applied before the softmax and the token probabilities are returned correctly.

I am not at all familiar with the history of the grammar code. My assumption would be that grammar_first exists for optimization purposes since the grammar is comparatively slow so it is overall faster to apply it last when there are fewer viable tokens to iterate over. However, I think that the correct way to combine a grammar and other samplers is to apply the grammar first and that the current code with the grammar last distorts the distribution of tokens in a way that is not well-defined and depends on the probability of tokens that the grammar disallows in the first place. For this reason I am simply changing the server code in this PR to apply the grammar first, this fixes the immediate issue of the post-sampling probabilities being wrong. I think the implementation with the grammar last can be salvaged but it would require comparatively more effort. And if the only reason for this implementation is better performance I would suggest that I continue working on my code first to determine whether such an optimization would be worthwhile in the first place.

@ggerganov
Copy link
Member

The only reason for applying the grammar post-sampling is because of performance (#8643 (comment)), but it's indeed technically not 100% correct. Still, I think the performance gains are significant so that we keep it like this by default. If needed, we can add a flag that would apply it pre-sampling.

@JohannesGaessler
Copy link
Collaborator Author

Alright. Right now I'm still prototyping stuff anyways and I'll just use a branch with this patch. I'll investigate grammar first vs. grammar last in terms of performance and how much biases the results and then come back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants