Skip to content

llama : add support for batched inference #2813

Closed
@ggerganov

Description

@ggerganov

We want to be able to generate multiple sequences sharing the same context (a.k.a. prompt) in parallel.

Demonstrated in one of the examples by @xaedes :

https://github.com/ggerganov/llama.cpp/blob/eff86d4f1334c08300d3cb1110dbac3c8e26286c/examples/baby-llama/baby-llama.cpp#L785-L794

Should become part of the official llama.cpp API

ref: #2789

Implementation details

Regarding the API for the batched inference functionality, one way is to add a function:

// TODO: better name?
void llama_context_set_parallel(struct llama_context * ctx, int n_batches);

This would reallocate the kv_self cache to fit n_batches batches.

During llama_eval, we do what we normally do, with the extra step of batching the input as demonstrated in the example. We can probably avoid changing the eval API by adding the implicit assumption that tokens will contain the tokens for n_batches batches:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.h#L315-L320

In the end, we just need to update the API for accessing the logits of all the batches, or once again - without changing the API, have an implicit assumption that the results will be for n_batches batches:

https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/llama.h#L341


So on first thought, we would just need a single new function added to llama.h - llama_context_set_parallel().
I think this should be enough, but I could be missing something

One optimization to consider is if we can avoid having separate KV caches for the common prefix of the parallel runs. The straightforward implementation would create a copy of this for each batch, while in theory we need just one. Not sure how complicated it would be to handle this. Might need to implement Paged Attention, which is probably a job for another time

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions