Description
Motivation.
TL;DR: Introduce SPMD-style control plane to improve control plane architecture and optimize performance.
For distributed inference, vLLM currently leverages a “driver-worker”, along with other workers. As shown in the diagram below, this driver-worker is in the same process as the driver. It prepares the arguments, then broadcasts them to all other workers to execute the sharded model, leveraging NCCL as the control plane.

This architecture has a few drawbacks. First, the driver-worker needs to participate in the NCCL group and execute the model. Since NCCL broadcast is a synchronous operation, this creates interference with other driver functionality such as scheduling and affects performance.
Moreover, this architecture made it difficult to support speculative decoding. Specifically,
- Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).
- Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.
Proposed Change.
We propose an architecture change to support SPMD-style control plane, as shown in the diagram below.

Specifically, we remove the argument preparation and model execution functionality from the driver, and make all workers SPMD-style: The LLMEngine/driver now passes the input to all the SPMD workers via a Ray DAG channel (shared memory), and each worker prepares arguments and executes its model shard. The results are passed back to the driver with Ray DAG channel as well.
Roadmap
SPMD functionality and optimizations:
- SPMD functionality is implemented in [Core] Introduce SPMD worker execution using Ray accelerated DAG #6032
- Delta input optimization (sending delta input, as opposed to full input) will be implemented and benchmarked
- Serialization optimization (e.g., using a different serialization format like msgspec)
Features to build on top of SPMD:
- Pipeline parallelism with Ray accelerated DAG
- Speculative decoding
After comprehensive benchmarking and optimizations, SPMD will become the default and NCCL based control plane code path will be cleaned up.
Feedback Period.
No response
CC List.
@youkaichao @stephanie-wang @rkooo567 @cadedaniel
Any Other Things.
No response