-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama : auto-batch preparation #13845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9d05381
to
2b984f4
Compare
Depending on how this will be implemented, this can lead to much bigger buffers than necessary for very big |
Yes, in some situations Lines 9 to 12 in 2921e85
|
|
f23e4cc
to
71619f2
Compare
ggml-ci
I realized that halving the # prompt input for sequence 0 and generation tokens for sequences 1 and 2:
00000000[large prompt for seq 0]00012 In some cases, when the SWA cache is full, but has room because of masked tokens still present in the cells, the logic will end up reducing So what we want to do instead is split into the following ubatches for example: # ubatch 0
00000000[large prompt for seq 0]00
# ubatch 1
012 This is the |
target #13746
The memory implementations now implement their respective optimal strategy for splitting the input batch into ubatches (see llama_memory_i::init_batch()). For example, the iSWA KV cache currently attempts to do a simple split of the input batch. In the future, it will be updated to try different splitting strategies if the simple split fails. For example, the batch splitting logic from
llama-server
should be done as a fallback. But to implement this, we first need to refactor thellama_sbatch
andllama_ubatch
implementation.outdated
This change adds logic to
llama_decode()
for retrying with smallern_ubatch
if we fail to fit the input batch. This logic is usually implemented in the user code, but it now comes integrated intolibllama
.Note that the user code can still continue to do it's own batching of the input. It's just no longer really needed.
As an example, the
llama-server
is updated to no longer perform this process manually. The rest of the examples will be updated in a follow-up PRs.This PR will be merged after #13746
Next PRs
Remove the notion of(not ready for this yet).n_batch
. We can now always work with the fulln_ctx
and simplify the logic of splitting the input inton_batch
sizes. (low prio)