Just how does -c work with -np ? #17671

mramendi · 2025-12-01T23:02:40Z

mramendi
Dec 1, 2025

Hello,

So, for llama-server, -np sets the number of requests processed at the same time and -c sets the context length - but how do they interact?

First I thought that -c was context length per request. However, when I start llama-server -m granite-4.0-h-1b-bf16.gguf -c 65536 -np 4 and then send a request of a bit over 30k tokens, it fails on context size.

Then I was told that -c was context length for the entire batch, so the actual context length per request is -c/-np. This thread #4130 also seems to be saying something like that. But this does not seem to work with how much memory is taken up depending on the parameters.

llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off gives:

llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5060 Ti) | 15848 = 58471 + (7271 =  2789 +    1221 +    3261) + 17592185994521 |
llama_memory_breakdown_print: |   - Host                |                   549 =   294 +       0 +     255                   |

llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off -np 2 gives:

llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5060 Ti) | 15848 = 58523 + (8161 =  2789 +    2110 +    3261) + 17592185993580 |
llama_memory_breakdown_print: |   - Host                |                   549 =   294 +       0 +     255                   |

llama-server -m granite-4.0-h-1b-bf16.gguf -c 128000 -ngl 99 -fa off -np 8 gives:

llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5060 Ti) | 15848 = 58514 + (14583 =  2789 +    8506 +    3287) + 17592185987166 |
llama_memory_breakdown_print: |   - Host                |                    551 =   294 +       0 +     257                   |

So the memory consumed for context would seem to mean that -c denotes the context for a single request after all? But, it does not seem to actually offer such context?

What is actually going on here? How do I set the context per request and the number of parallel requests properly? Would very much appreciate enlightenment here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Just how does -c work with -np ? #17671

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Just how does -c work with -np ? #17671

Uh oh!

mramendi Dec 1, 2025

Replies: 0 comments

mramendi
Dec 1, 2025