What's the difference between batch-size and ubatch-size? #6328

Sunt-ing · 2024-03-26T19:28:30Z

Sunt-ing
Mar 26, 2024

Thanks for your help!

Answered by phymbert

Mar 26, 2024

The default values are here:

https://github.com/ggerganov/llama.cpp/blob/557410b8f06380560155ac7fcb8316d71ddc9837/common/common.h#L57

View full answer

phymbert · 2024-03-26T19:48:31Z

phymbert
Mar 26, 2024
Collaborator

Hello, good question!

--batch-size size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode. For the server, this is the maximum number of tokens per iteration during continuous batching
--ubatch-size physical maximum batch size for computation

So batch size is at the application level, while ubatch size is at the device level. batch_size >= ubatch_size.

You can find some references here:

8 replies

Sunt-ing Mar 26, 2024
Author

Thanks for your provided info! So it looks like this is for pipeline parallelization. For example, if I use batch-size=2048 and ubatch-size=512, then it's a 4-stage pipeline parallelization. Do I understand correctly?

Also, I find that in the main example, the default batch-size is 512, while in the server doc it's 2048. Is it correct?

phymbert Mar 26, 2024
Collaborator

The default values are here:

https://github.com/ggerganov/llama.cpp/blob/557410b8f06380560155ac7fcb8316d71ddc9837/common/common.h#L57

Answer selected by Sunt-ing

Sunt-ing Mar 26, 2024
Author

Thanks for your careful and patient explanation!

Sunt-ing Mar 26, 2024
Author

And I think maybe we can update the doc as its default value is outdated: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#batch-size

phymbert Mar 26, 2024
Collaborator

Yes, feel free to open a PR

robbiemu Oct 25, 2024

I'm noticing that the llama_cpp_python bindings (different project, I know) still have batch and ubatch both at 512. TMYK

just to make sure I understand this -- $$batch>μbatch$$ is only, at best, speeding up performance by cutting down on disk loads, unless we have multiple gpus (but that still could be significant)?

in other words, does llama-perplexity run chunks in parallel any time batch and μbatch are > context size, or does it also need flags (and hardware) for multiple GPUs? (like -sm)

dlyz · 2024-05-02T09:29:53Z

dlyz
May 2, 2024

https://github.com/ggerganov/llama.cpp/blob/b0d943de179ad5dbd83d51f327fb566066f4ccda/llama.cpp#L11432C1-L11434C14

@phymbert
Am I missing something or it may not work correctly from the API standpoint, when n_batch > n_ubatch and some of the first ubatches has been already committed to the kv-cache and decoded, but no slot found for the last ubatch for example?
What I mean is after this case state of the kv-cache will be kind of corrupted, original batch is partially processed (and we don't know what part of it is) so there is no way to recover from it on the fly (for example do the defrag and try again) and the whole cache should be cleaned. Or, again, I am missing something.

2 replies

phymbert May 2, 2024
Collaborator

@slaren probably has a better view

slaren May 3, 2024
Maintainer

This is something that can happen. It should be possible to detect when this happens by using llama_kv_cache_seq_pos_max, but none of the examples do this. Ideally, I think this should be avoided entirely by reserving the KV space for the entire batch before processing anything, but I am not sure if this is possible with the current functions.

hifun0916 · 2025-08-28T02:57:33Z

hifun0916
Aug 28, 2025

@phymbert

Execuse me,

I'm using 4x Tesla T4 GPUs for computation and conducted experiments testing various KV cache type settings. I found that the --batch-size setting has almost no impact on first token time or inference time.
In additional tests (not shown in the attached table), I also tested changing ubatch-size from the default value of 512 to 2048, which only reduced the first token time from 36 seconds to 26 seconds, while inference time remained completely unchanged.

My question is: In what use scenarios should I adjust batch-size or ubatch-size?
In my experiments, adjusting these two parameters did not bring significant benefits.

Test Setup:
Hardware: 4x Tesla T4 GPUs
Input: 5800 tokens
Various KV cache quantization types tested (f16, Q8, Q4)
Batch-size tested: default (2048) vs 16384

Results:

Batch-size changes: negligible impact

I would appreciate guidance on the specific scenarios where these parameters provide meaningful performance improvements.

0 replies

What's the difference between batch-size and ubatch-size? #6328

Uh oh!

Sunt-ing Mar 26, 2024

Replies: 3 comments · 10 replies

Uh oh!

phymbert Mar 26, 2024 Collaborator

Uh oh!

Sunt-ing Mar 26, 2024 Author

Uh oh!

phymbert Mar 26, 2024 Collaborator

Uh oh!

Sunt-ing Mar 26, 2024 Author

Uh oh!

Sunt-ing Mar 26, 2024 Author

Uh oh!

phymbert Mar 26, 2024 Collaborator

Uh oh!

Uh oh!

robbiemu Oct 25, 2024

Uh oh!

dlyz May 2, 2024

Uh oh!

phymbert May 2, 2024 Collaborator

Uh oh!

Uh oh!

slaren May 3, 2024 Maintainer

Uh oh!

hifun0916 Aug 28, 2025

Sunt-ing
Mar 26, 2024

Replies: 3 comments 10 replies

phymbert
Mar 26, 2024
Collaborator

Sunt-ing Mar 26, 2024
Author

phymbert Mar 26, 2024
Collaborator

Sunt-ing Mar 26, 2024
Author

Sunt-ing Mar 26, 2024
Author

phymbert Mar 26, 2024
Collaborator

dlyz
May 2, 2024

phymbert May 2, 2024
Collaborator

slaren May 3, 2024
Maintainer

hifun0916
Aug 28, 2025