-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perplexity : faster Winogrande via batching #5024
Conversation
batch.logits + i, | ||
0, 0, 0, // unused | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The llama_kv_cache_seq_rm
call is no longer needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed because we clear the entire KV cache before each batch:
llama.cpp/examples/perplexity/perplexity.cpp
Lines 942 to 944 in 9e4ad80
llama_kv_cache_clear(ctx); | |
In the old implementation, it was reusing tokens from a previous batch, so the llama_kv_cache_seq_rm
was used to evict the unused ones (i.e. the second sentence).
The score with Mistral 7B has changed in the "positive" direction as well, closer to the HF leaderboard results:
|
Yes, I saw this change, but could not understand the reason. |
* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande
* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande
Similar to #5017 but for the Winogrande evaluation. Here the performance gain is bigger since the tasks are smaller and we can fit more in a batch.
I'm not 100% sure about the correctness of this implementation - the final value is different from
master
:It's possible that it fixes some existing issue, but it is also possible that it introduces a bug. The softmax optimization is not applied yet - it can be applied in a follow-up PR