You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So, for llama-server, -np sets the number of requests processed at the same time and -c sets the context length - but how do they interact?
First I thought that -c was context length per request. However, when I start llama-server -m granite-4.0-h-1b-bf16.gguf -c 65536 -np 4 and then send a request of a bit over 30k tokens, it fails on context size.
Then I was told that -c was context length for the entire batch, so the actual context length per request is -c/-np. This thread #4130 also seems to be saying something like that. But this does not seem to work with how much memory is taken up depending on the parameters.
llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off gives:
So the memory consumed for context would seem to mean that -c denotes the context for a single request after all? But, it does not seem to actually offer such context?
What is actually going on here? How do I set the context per request and the number of parallel requests properly? Would very much appreciate enlightenment here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
So, for
llama-server,-npsets the number of requests processed at the same time and-csets the context length - but how do they interact?First I thought that
-cwas context length per request. However, when I startllama-server -m granite-4.0-h-1b-bf16.gguf -c 65536 -np 4and then send a request of a bit over 30k tokens, it fails on context size.Then I was told that
-cwas context length for the entire batch, so the actual context length per request is-c/-np. This thread #4130 also seems to be saying something like that. But this does not seem to work with how much memory is taken up depending on the parameters.llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa offgives:llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off -np 2gives:llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off -np 8gives:So the memory consumed for context would seem to mean that
-cdenotes the context for a single request after all? But, it does not seem to actually offer such context?What is actually going on here? How do I set the context per request and the number of parallel requests properly? Would very much appreciate enlightenment here.
Beta Was this translation helpful? Give feedback.
All reactions