-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Closed
Labels
feature requestNew feature or requestNew feature or requestperformancePerformance-related issuesPerformance-related issues
Description
Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),
DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:
-
Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
-
Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.
Code: https://github.com/microsoft/DeepSpeed-MII
Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen
anttttti, valiantljk, 0x1997, Peilun-Li, leocnj and 14 more
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestperformancePerformance-related issuesPerformance-related issues