server: apply grammar before other samplers #12643
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context: I've recently started working on code for evaluating language model benchmarks such as MMLU using the llama.cpp server. The approach I'm taking is to present the model with a multiple choice question, restrict the output to exactly the given choices with a grammar, and to then retrieve the post-sampling probabilities for the choices. The individual probabilities for the choices should then add up to 1.
I noticed that with the current server code, when disabling all of the chain samplers except softmax and using a grammar, the post-sampling probabilities of the tokens allowed by the grammar don't consistently add up to 1. The reason is that tokens disallowed by the grammar can be assigned non-zero probabilities. This is because
common_sampler_sample
has a boolean parametergrammar_first
which the server sets to false. The probabilities are then those of the softmax and they are not being recalculated when the function simply checks whether the sampled token is allowed by the grammar. On the other hand, if the token is disallowed by the grammar, then the sampling is re-run with the grammar being applied before the softmax and the token probabilities are returned correctly.I am not at all familiar with the history of the grammar code. My assumption would be that
grammar_first
exists for optimization purposes since the grammar is comparatively slow so it is overall faster to apply it last when there are fewer viable tokens to iterate over. However, I think that the correct way to combine a grammar and other samplers is to apply the grammar first and that the current code with the grammar last distorts the distribution of tokens in a way that is not well-defined and depends on the probability of tokens that the grammar disallows in the first place. For this reason I am simply changing the server code in this PR to apply the grammar first, this fixes the immediate issue of the post-sampling probabilities being wrong. I think the implementation with the grammar last can be salvaged but it would require comparatively more effort. And if the only reason for this implementation is better performance I would suggest that I continue working on my code first to determine whether such an optimization would be worthwhile in the first place.