llama : add support for batched inference

We want to be able to generate multiple sequences sharing the same context (a.k.a. prompt) in parallel.

Demonstrated in one of the examples by @xaedes :

https://github.com/ggerganov/llama.cpp/blob/eff86d4f1334c08300d3cb1110dbac3c8e26286c/examples/baby-llama/baby-llama.cpp#L785-L794

Should become part of the official `llama.cpp` API

ref: https://github.com/ggerganov/llama.cpp/issues/2789

### Implementation details

Regarding the API for the batched inference functionality, one way is to add a function:

```c
// TODO: better name?
void llama_context_set_parallel(struct llama_context * ctx, int n_batches);
```

This would reallocate the `kv_self` cache to fit `n_batches` batches.

During `llama_eval`, we do what we normally do, with the extra step of batching the input as demonstrated in the example. We can probably avoid changing the `eval` API by adding the implicit assumption that `tokens` will contain the tokens for `n_batches` batches:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.h#L315-L320

In the end, we just need to update the API for accessing the logits of all the batches, or once again - without changing the API, have an implicit assumption that the results will be for `n_batches` batches:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.h#L341

---

So on first thought, we would just need a single new function added to `llama.h` - `llama_context_set_parallel()`.
I think this should be enough, but I could be missing something

One optimization to consider is if we can avoid having separate KV caches for the common prefix of the parallel runs. The straightforward implementation would create a copy of this for each batch, while in theory we need just one. Not sure how complicated it would be to handle this. Might need to implement Paged Attention, which is probably a job for another time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add support for batched inference #2813

Implementation details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : add support for batched inference #2813

Description

Implementation details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions