You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Loading any GGUF with --cache-type-k q8_0 --cache-type-v q8_0 (or any other quantization) makes the server segfault. This should fail mentioning that KV quantization only works with flash attention (--flash_attn).
Invoking the cli with --cache-type-k q8_0 --cache-type-v q8_0 --flash_attn everything seems to work properly, however the answers are complete rubbish (E.g. an infinite stream of exclamation marks or similar things).
This works fine in llama.ccp, which is why I decided to raise the issue.
Contact Details
marcello.seri@gmail.com
What happened?
Loading any GGUF with
--cache-type-k q8_0 --cache-type-v q8_0
(or any other quantization) makes the server segfault. This should fail mentioning that KV quantization only works with flash attention (--flash_attn
).Invoking the cli with
--cache-type-k q8_0 --cache-type-v q8_0 --flash_attn
everything seems to work properly, however the answers are complete rubbish (E.g. an infinite stream of exclamation marks or similar things).This works fine in llama.ccp, which is why I decided to raise the issue.
Version
llamafile v0.8.16 (main branch HEAD at 0995343)
What operating system are you seeing the problem on?
No response
Relevant log output
The text was updated successfully, but these errors were encountered: