Skip to content

llama : auto-batch preparation #13845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 31, 2025
Merged

llama : auto-batch preparation #13845

merged 2 commits into from
May 31, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented May 28, 2025

target #13746

The memory implementations now implement their respective optimal strategy for splitting the input batch into ubatches (see llama_memory_i::init_batch()). For example, the iSWA KV cache currently attempts to do a simple split of the input batch. In the future, it will be updated to try different splitting strategies if the simple split fails. For example, the batch splitting logic from llama-server should be done as a fallback. But to implement this, we first need to refactor the llama_sbatch and llama_ubatch implementation.

outdated

This change adds logic to llama_decode() for retrying with smaller n_ubatch if we fail to fit the input batch. This logic is usually implemented in the user code, but it now comes integrated into libllama.

Note that the user code can still continue to do it's own batching of the input. It's just no longer really needed.

As an example, the llama-server is updated to no longer perform this process manually. The rest of the examples will be updated in a follow-up PRs.

This PR will be merged after #13746

Next PRs

  • Remove the notion of n_batch. We can now always work with the full n_ctx and simplify the logic of splitting the input into n_batch sizes. (low prio) (not ready for this yet).

@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch 6 times, most recently from 9d05381 to 2b984f4 Compare May 30, 2025 08:29
@ggerganov ggerganov marked this pull request as ready for review May 30, 2025 14:35
@ggerganov ggerganov requested a review from ngxson as a code owner May 30, 2025 14:35
@compilade
Copy link
Collaborator

Remove the notion of n_batch. We can now always work with the full n_ctx and simplify the logic of splitting the input into n_batch sizes

Depending on how this will be implemented, this can lead to much bigger buffers than necessary for very big n_ctx (e.g 1M tokens for a recurrent model (so the KV cache doesn't scale with n_ctx)). Batches help keeping that small.

@ggerganov
Copy link
Member Author

Remove the notion of n_batch. We can now always work with the full n_ctx and simplify the logic of splitting the input into n_batch sizes

Depending on how this will be implemented, this can lead to much bigger buffers than necessary for very big n_ctx (e.g 1M tokens for a recurrent model (so the KV cache doesn't scale with n_ctx)). Batches help keeping that small.

Yes, in some situations n_batch still makes sense in the user code. But for the llama_context I think this parameter is redundant. Referring to this n_batch:

struct llama_cparams {
uint32_t n_ctx; // context size used during inference
uint32_t n_batch;
uint32_t n_ubatch;

@slaren
Copy link
Member

slaren commented May 30, 2025

n_batch still puts a limit the size of the output buffer. That's why it was kept when n_ubatch was added, the original implementation removed or ignored n_batch. However since then the output buffers are dynamically reallocated as needed, so it is not as much of a problem anymore.

@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from f23e4cc to 71619f2 Compare May 31, 2025 07:05
Base automatically changed from gg/kv-cache-simplify-part3 to master May 31, 2025 07:24
@ggerganov ggerganov changed the title llama : auto-batch llama : auto-batch preparation May 31, 2025
@ggerganov
Copy link
Member Author

ggerganov commented May 31, 2025

I realized that halving the n_ubatch size and retrying (as proposed originally in this PR) is not equivalent to what the server batching actually does. The halving of n_ubatch strategy is not very smart because there are some degenerate cases that can affect the performance significantly. For example, a large batch with mixed sequences:

# prompt input for sequence 0 and generation tokens for sequences 1 and 2:
00000000[large prompt for seq 0]00012

In some cases, when the SWA cache is full, but has room because of masked tokens still present in the cells, the logic will end up reducing n_ubatch down to 2 in order to be able to fit the multi-sequence tail of the input batch. But processing thousands of tokens with n_ubatch == 2 is very slow.

So what we want to do instead is split into the following ubatches for example:

# ubatch 0
00000000[large prompt for seq 0]00
# ubatch 1
012

This is the split_equal strategy. So the goal would be to implement such kind of logics for each init_batch(). This is just a small preparation for this.

@ggerganov ggerganov merged commit 3f55f78 into master May 31, 2025
6 checks passed
@ggerganov ggerganov deleted the gg/auto-batch branch May 31, 2025 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants