-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token generation broken on CUDA when offload_kqv
is false
#4991
Comments
Duplicate of #4983? |
Yup, will be looking into this - if anyone has additional insights like which models / params work and which do not work would be helpful |
@ggerganov I tried with below model and params, and it ran into the generation error.
|
Hopefully some useful info: This occurs on every type of model I've tried: llama, mistral, mixtral, even llava, regardless of quantization. When running a modified Looking at different values for |
This should have been fixed in #5049 (already merged), please let me know if you find any other issues. |
Forgot to close this, it works, thanks for the fix! |
Originally spotted by @iamlemec in abetlen/llama-cpp-python#1089 reproduced with llama.cpp by passing
--no_kv_offload
to./main
. Bug causes the model to generate repeated#
's instead of a valid completion.The text was updated successfully, but these errors were encountered: