Closed
Description
Refs:
In simple terms, after implementing batched decoding (a.k.a. parallel decoding) we can extend the inference functionality to support applying a custom attention mask to the batch. This can be used to create a causal tree mask that allows to evaluate a tree of continuations in a single pass, instead of a large batch of independent sequences.
This is useful for implementing advanced speculative strategies such as SpecInfer's token tree verification and Medusa heads