Description
We want to be able to generate multiple sequences sharing the same context (a.k.a. prompt) in parallel.
Demonstrated in one of the examples by @xaedes :
Should become part of the official llama.cpp
API
ref: #2789
Implementation details
Regarding the API for the batched inference functionality, one way is to add a function:
// TODO: better name?
void llama_context_set_parallel(struct llama_context * ctx, int n_batches);
This would reallocate the kv_self
cache to fit n_batches
batches.
During llama_eval
, we do what we normally do, with the extra step of batching the input as demonstrated in the example. We can probably avoid changing the eval
API by adding the implicit assumption that tokens
will contain the tokens for n_batches
batches:
In the end, we just need to update the API for accessing the logits of all the batches, or once again - without changing the API, have an implicit assumption that the results will be for n_batches
batches:
https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.h#L341
So on first thought, we would just need a single new function added to llama.h
- llama_context_set_parallel()
.
I think this should be enough, but I could be missing something
One optimization to consider is if we can avoid having separate KV caches for the common prefix of the parallel runs. The straightforward implementation would create a copy of this for each batch, while in theory we need just one. Not sure how complicated it would be to handle this. Might need to implement Paged Attention, which is probably a job for another time