llama : add example for tree-based parallel decoding

Refs:

- https://arxiv.org/pdf/2305.09781.pdf
- https://arxiv.org/pdf/2308.04623.pdf

In simple terms, after implementing [batched decoding (a.k.a. parallel decoding)](https://github.com/ggerganov/whisper.cpp/issues/1048) we can extend the inference functionality to support applying a custom attention mask to the batch. This can be used to create a causal tree mask that allows to evaluate a tree of continuations in a single pass, instead of a large batch of independent sequences.

This is useful for implementing advanced speculative strategies such as SpecInfer's token tree verification and [Medusa heads](https://sites.google.com/view/medusa-llm)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add example for tree-based parallel decoding #3137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : add example for tree-based parallel decoding #3137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions