-
|
I am running llama.cpp for GLM4.7 with Unsloth Q4 on 3090 x 2 with today's master head build (a3e8128) ./llama.cpp/llama-server \
-hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --host 0.0.0.0 \
--alias "unsloth/GLM-4.7-Flash" \
--threads -1 \
--fit on \
--seed 3407 \
--temp 0.7 \
--top-k 50 \
--top-p 1.0 \
--min-p 0.01 \
--dry-multiplier 0.0 \
--ctx-size 128000 \
--jinjaAnd i found the very first question, after boot the server got 100+ token/s while afterwards they got 20+token/s. Why could this be caused 🤔?
|
Beta Was this translation helpful? Give feedback.
Answered by
akumaburn
Jan 23, 2026
Replies: 1 comment 3 replies
-
|
Did you download the latest quants (there was a bug that was fixed recently in llama.cpp that caused issues like looping in Unsloth's quants) - See Jan 21 update: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF Try setting the flag |
Beta Was this translation helpful? Give feedback.
3 replies
Answer selected by
wey-gu
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment


Did you download the latest quants (there was a bug that was fixed recently in llama.cpp that caused issues like looping in Unsloth's quants) - See Jan 21 update: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
Try setting the flag
--parallel 1to ensure there isn't any parallel requests going on.