[RFC]: Flexible Weight Sync for vLLM Workers

### Motivation.

**TL;DR:** This RFC proposes optimizing weight synchronization between training processes and vLLM workers for efficient RLHF implementation.

We are pleased to see vLLM collaborating with the reinforcement learning (RL) community and working on introducing useful APIs such as `update_weight` as proposed in [#5723](https://github.com/vllm-project/vllm/issues/5723). Following that, a seemingly good practice would be the implementation of OpenRLHF, which can use nccl (previously gloo) as the backend to collect weights for vLLM workers to update. Below is a simple implementation derived from the ongoing discussion in [this comment](https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656):
```
# In main training process (rank 0)
    # Broadcast the parameters on rank 0 to other ranks
    torch.distributed.broadcast(param.data, 0, group=model_update_group)
    
# In vLLM Engine driver process
    # Trigger the vLLM workers to update_weight
    self.llm.llm_engine.model_executor._run_workers("update_weight", name, dtype, shape, empty_cache)

# In vLLM SPMD workers
    # Create an empty tensor and then receive the param.data from rank 0
    weight = torch.empty(shape, dtype=dtype, device="cuda")
    torch.distributed.broadcast(weight, 0, group=self._model_update_group)
    self.model_runner.model.load_weights(weights=[(name, weight)])
```
    
Using the above approach (depicted in Figure 1), weights on each vLLM worker are successfully updated. However, we have identified two significant inefficiencies, even after the implementation of [#5723](https://github.com/vllm-project/vllm/issues/5723):
#### 1. High weight synchronization overhead
  - The centralization of the rank 0/driver process as a parameter server creates a bottleneck:
    - Limited device memory on rank 0 necessitates layer-by-layer parameter loading.
    - Training processes may need to dump parameters to disk as HF checkpoints before broadcasting them to vLLM processes.
#### 2. Inflexibility
  - vLLM workers are constrained by their inability to load weights directly from external processes. Currently, each vLLM worker creates an NCCL process group (`model_update_group`) during initialization and must allocate and manage separate buffers for receiving weights.
To address these issues, [veRL](https://github.com/volcengine/verl) utilizes weight resharding, where training worker processes directly transmit weight tensors to vLLM worker processes. This eliminates the need for rank 0 to act as a relay point (see Figure 2 for reference).

<img width="450" alt="image" src="https://github.com/user-attachments/assets/0c8574e7-8739-4437-ae8c-931443859c4f" />

_Figure 1. Driver process on rank 0 gathers the weights from training processes and broadcast to vLLM workers._

### Proposed Change.


<img width="558" alt="image" src="https://github.com/user-attachments/assets/ad8d9918-7b4d-44de-ac4b-1bb5c474894f" />

_Figure 2. Each vLLM worker directly all-gathers the weights from GPU1 and GPU2._

We use the scenario in Figure 2(a) as illustration, where each vLLM worker process requires weight updates from multiple training processes. To facilitate efficient weight synchronization, we propose two potential solutions:
#### 1. Weight Synchronization Communication Group
- Create a weight synchronization communication group (e.g., with NCCL as the backend) for vLLM workers.
- The driver process triggers customized weight resharding operations, enabling vLLM worker processes to share the same communication groups as training processes (see Figure 2(b)).
- Concerns:
  - Compatibility with different training frameworks may be challenging.
  - Maintaining these communication groups could increase complexity.
#### 2. Direct Weight Access by vLLM Workers
- Enable vLLM worker processes to directly access weights in training processes, eliminating the need for intermediate weight buffers.
- Instead of handling tensors in `model_runner.model.load_weights(weights=[(name, weight)])`, weights would be represented as handlers to shared tensors across processes. For example:
  - When process V1 requires weights, it fetches them directly from training processes T1 and T2.
  - This approach could leverage mechanisms like CUDA IPC shared memory or CUDA unified memory.
- Required Features:
  1. Implement handler parsing logic within the `load_weights` function.
  2. Automate tensor fetching across processes.

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Flexible Weight Sync for vLLM Workers #11399

Motivation.

1. High weight synchronization overhead

2. Inflexibility

Proposed Change.

1. Weight Synchronization Communication Group

2. Direct Weight Access by vLLM Workers

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Flexible Weight Sync for vLLM Workers #11399

Description

Motivation.

1. High weight synchronization overhead

2. Inflexibility

Proposed Change.

1. Weight Synchronization Communication Group

2. Direct Weight Access by vLLM Workers

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions