Skip to content

[FEATURE] Implement Dynamic SplitFuse #1562

Closed
@casper-hansen

Description

@casper-hansen

Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),

DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:

  • Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.

  • Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.

Code: https://github.com/microsoft/DeepSpeed-MII
Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

Llama 13B (1x A100-80GB):
image

Llama 70B (4x A100x80GB with TP):
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions