Skip to content

[SpecDecode] Support EAGLE in V1 #15901

Open
@WoosukKwon

Description

@WoosukKwon
  • 1. Correctly initializing and loading the EAGLE draft model
  • 2. Consider the lookahead slots in the KV cache manager
  • 3. Cache draft_probs inside the model runner and correctly feed it to the rejection sampler in the next step (temporarily workaround: [V1][Spec Decode] Always use argmax for sampling draft tokens  #16899)
  • 4. Handle the edge cases like when the draft model generates beyond max_pos_embeddings
  • 5. Handle the seeds correctly
  • 6. Do E2E correctness and performance tests
  • 7. Support prefix caching. Eagle requires special handling because Eagle's i-th KV cache is coupled with the i+1-th token ID. (@LiuXiaoxuanPKU)
  • 8. Properly handle the sampling parameters that are not (currently) compatible with spec decoding (e.g., min_p).
  • 9. Use CUDA graphs for draft model. (@luyuzhe111)
  • 10. Support Eagle 3 ([V1][Spec Decode] EAGLE-3 Support #16937)

Originally posted by @WoosukKwon in #15729 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions