Closed
Description
Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),
DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:
-
Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
-
Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.
Code: https://github.com/microsoft/DeepSpeed-MII
Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen