[RFC]: [Spec Decode] Combine Ngram and EAGLE

### Motivation.

EAGLE and Ngram give speedup in different scenarios. 
* Ngram is beneficial when the prompt has ngrams which can be useful for creating drafts, e.g., editing task
* EAGLE is a drafting strategy which can propose draft tokens in any general situation

Right now, we can either start vllm serve either with ngram OR with EAGLE. We need to decide beforehand which draft strategy we are going to use. This means we have to deploy different model instances in production if we want to support editing request and general requests. This leads to less than optimal GPU resource utilization since traffic for editing and non-editing task will not be static

### Proposed Change.

If we combine ngram and EAGLE then we don't need separate instances of ngram and EAGLE deployments and the instance can pick which strategy to use at any given timestep for a given request. Ngram is a model-free drafter whereas EAGLE needs a draft model so the overhead of running them will not be too much over EAGLE's overhead. It could improve overall AL compared to just using EAGLE. 

#### Draft merge strategy 
For any given timestep and for a given seq in a batch: 
* find if there any ngram match
  * If yes, then we let these be the draft token and skip EAGLE for this seq
  * If no, then we put them to a batch to be processed by EAGLE
* run EAGLE if there is any seq that needs it
* verify the proposed tokens as usual using target model and Rejection Sampler

The key insight is that the optimal K for EAGLE we have found in practice is ~3 ([src1](https://github.com/vllm-project/vllm/issues/17812), [src2](https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/)), i.e., EAGLE proposes max 3 tokens at a time. Ngram usually is with 5 but can be set higher. If the `prompt_lookup_max` and `prompt_lookup_min` are set correctly, an ngram match will give us the signal that the current sequence/dataset at the current step has a good chance of benefiting from ngram lookup. If no match is found then we defer it to EAGLE to find us the draft. The merged strategy will be a different method name like `ngram-eagle` and will not change default behavior of `ngram` or `eagle`.

### Feedback Period.

1 week

### CC List.

@LiuXiaoxuanPKU @WoosukKwon

### Any Other Things.

One of the challenge is the impact on torch.compile and cugraph since the number of draft can be different per seq during verification step since K=~3 for EAGLE and K>=5 for Ngram. I will see how it works out but feel free to share if anyone has any thoughts.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: [Spec Decode] Combine Ngram and EAGLE #18633

Motivation.

Proposed Change.

Draft merge strategy

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: [Spec Decode] Combine Ngram and EAGLE #18633

Description

Motivation.

Proposed Change.

Draft merge strategy

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions