Closed as not planned
Description
We have built the system described in http://aka.ms/splitwise
Splitwise splits the prompt and token phases to run in different servers.
This leverages the differences between these two phases to improve throughput.
We have an internal prototype on top of an internal vLLM branch.
This issue tracks the effort to open source this prototype and make it part of the official vLLM.
This includes:
- Add MSCCL++ support https://github.com/microsoft/mscclpp
- Add per-layer KV-cache transfer
- Coordination across prompt and token servers
- Documentation
Metadata
Metadata
Assignees
Labels
No labels