Skip to content

llama : add example for tree-based parallel decoding #3137

Closed
@ggerganov

Description

@ggerganov

Refs:

In simple terms, after implementing batched decoding (a.k.a. parallel decoding) we can extend the inference functionality to support applying a custom attention mask to the batch. This can be used to create a causal tree mask that allows to evaluate a tree of continuations in a single pass, instead of a large batch of independent sequences.

This is useful for implementing advanced speculative strategies such as SpecInfer's token tree verification and Medusa heads

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions