Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling : avoid expensive softmax during greedy sampling #9605

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

fix #9530

When the temperature is non-positive, we can simply sample greedily the token with the highest logit. But in some cases, the probs of the secondary tokens are also required (e.g. llama-server to display candidate probs, llama-speculative to peform stochastic speculative sampling). In such cases, we first filter the the top sparams.n_probs tokens via a top-k sampler and then apply softmax to them in order to avoid sorting the full vocabulary.

Also add perf timings to test-sampling to keep track of the performance of the samplers.

@github-actions github-actions bot added testing Everything test related examples labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64
1 participant