why the very first query is extremely fast while afterwards not? #19046

wey-gu · 2026-01-23T11:13:12Z

wey-gu
Jan 23, 2026

I am running llama.cpp for GLM4.7 with Unsloth Q4 on 3090 x 2 with today's master head build (a3e8128)

./llama.cpp/llama-server \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --host 0.0.0.0 \
    --alias "unsloth/GLM-4.7-Flash" \
    --threads -1 \
    --fit on \
    --seed 3407 \
    --temp 0.7 \
    --top-k 50 \
    --top-p 1.0 \
    --min-p 0.01 \
    --dry-multiplier 0.0 \
    --ctx-size 128000 \
    --jinja

And i found the very first question, after boot the server got 100+ token/s while afterwards they got 20+token/s.

Why could this be caused 🤔?

Answered by akumaburn

Jan 23, 2026

Did you download the latest quants (there was a bug that was fixed recently in llama.cpp that caused issues like looping in Unsloth's quants) - See Jan 21 update: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Try setting the flag --parallel 1 to ensure there isn't any parallel requests going on.

View full answer

akumaburn · 2026-01-23T12:58:34Z

akumaburn
Jan 23, 2026

Did you download the latest quants (there was a bug that was fixed recently in llama.cpp that caused issues like looping in Unsloth's quants) - See Jan 21 update: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Try setting the flag --parallel 1 to ensure there isn't any parallel requests going on.

3 replies

wey-gu Jan 24, 2026
Author

Thanks, yes I did use the updated one🫡

wey-gu Jan 24, 2026
Author

Thanks!Will give a try on setting parallel!

wey-gu Jan 24, 2026
Author

Thanks @akumaburn so much! It was parallel 1 helping on this!

So parallel non-1 policy caused this 10x performance here🫡, so very interesting 🫡

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the very first query is extremely fast while afterwards not? #19046

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

why the very first query is extremely fast while afterwards not? #19046

Uh oh!

Uh oh!

wey-gu Jan 23, 2026

Replies: 1 comment · 3 replies

Uh oh!

akumaburn Jan 23, 2026

Uh oh!

wey-gu Jan 24, 2026 Author

Uh oh!

wey-gu Jan 24, 2026 Author

Uh oh!

wey-gu Jan 24, 2026 Author

wey-gu
Jan 23, 2026

Replies: 1 comment 3 replies

akumaburn
Jan 23, 2026

wey-gu Jan 24, 2026
Author

wey-gu Jan 24, 2026
Author

wey-gu Jan 24, 2026
Author