server: apply grammar before other samplers #12643

JohannesGaessler · 2025-03-29T12:52:46Z

Context: I've recently started working on code for evaluating language model benchmarks such as MMLU using the llama.cpp server. The approach I'm taking is to present the model with a multiple choice question, restrict the output to exactly the given choices with a grammar, and to then retrieve the post-sampling probabilities for the choices. The individual probabilities for the choices should then add up to 1.

I noticed that with the current server code, when disabling all of the chain samplers except softmax and using a grammar, the post-sampling probabilities of the tokens allowed by the grammar don't consistently add up to 1. The reason is that tokens disallowed by the grammar can be assigned non-zero probabilities. This is because common_sampler_sample has a boolean parameter grammar_first which the server sets to false. The probabilities are then those of the softmax and they are not being recalculated when the function simply checks whether the sampled token is allowed by the grammar. On the other hand, if the token is disallowed by the grammar, then the sampling is re-run with the grammar being applied before the softmax and the token probabilities are returned correctly.

I am not at all familiar with the history of the grammar code. My assumption would be that grammar_first exists for optimization purposes since the grammar is comparatively slow so it is overall faster to apply it last when there are fewer viable tokens to iterate over. However, I think that the correct way to combine a grammar and other samplers is to apply the grammar first and that the current code with the grammar last distorts the distribution of tokens in a way that is not well-defined and depends on the probability of tokens that the grammar disallows in the first place. For this reason I am simply changing the server code in this PR to apply the grammar first, this fixes the immediate issue of the post-sampling probabilities being wrong. I think the implementation with the grammar last can be salvaged but it would require comparatively more effort. And if the only reason for this implementation is better performance I would suggest that I continue working on my code first to determine whether such an optimization would be worthwhile in the first place.

ggerganov · 2025-03-29T13:31:35Z

The only reason for applying the grammar post-sampling is because of performance (#8643 (comment)), but it's indeed technically not 100% correct. Still, I think the performance gains are significant so that we keep it like this by default. If needed, we can add a flag that would apply it pre-sampling.

JohannesGaessler · 2025-03-29T14:15:53Z

Alright. Right now I'm still prototyping stuff anyways and I'll just use a branch with this patch. I'll investigate grammar first vs. grammar last in terms of performance and how much biases the results and then come back to this.

server: apply grammar before other samplers

7a0a88d

JohannesGaessler requested a review from ngxson as a code owner March 29, 2025 12:52

github-actions bot added examples server labels Mar 29, 2025

JohannesGaessler closed this Mar 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: apply grammar before other samplers #12643

server: apply grammar before other samplers #12643

Uh oh!

JohannesGaessler commented Mar 29, 2025

Uh oh!

ggerganov commented Mar 29, 2025

Uh oh!

JohannesGaessler commented Mar 29, 2025

Uh oh!

Uh oh!

server: apply grammar before other samplers #12643

server: apply grammar before other samplers #12643

Uh oh!

Conversation

JohannesGaessler commented Mar 29, 2025

Uh oh!

ggerganov commented Mar 29, 2025

Uh oh!

JohannesGaessler commented Mar 29, 2025

Uh oh!

Uh oh!