Open
Description
- 1. Correctly initializing and loading the EAGLE draft model
- 2. Consider the lookahead slots in the KV cache manager
- 3. Cache
draft_probs
inside the model runner and correctly feed it to the rejection sampler in the next step (temporarily workaround: [V1][Spec Decode] Always use argmax for sampling draft tokens #16899) - 4. Handle the edge cases like when the draft model generates beyond
max_pos_embeddings
- 5. Handle the seeds correctly
- 6. Do E2E correctness and performance tests
- 7. Support prefix caching. Eagle requires special handling because Eagle's i-th KV cache is coupled with the i+1-th token ID. (@LiuXiaoxuanPKU)
- 8. Properly handle the sampling parameters that are not (currently) compatible with spec decoding (e.g., min_p).
- 9. Use CUDA graphs for draft model. (@luyuzhe111)
- 10. Support Eagle 3 ([V1][Spec Decode] EAGLE-3 Support #16937)
Originally posted by @WoosukKwon in #15729 (comment)
Metadata
Metadata
Assignees
Type
Projects
Status
No status