Description
Motivation.
TL;DR: This RFC proposes optimizing weight synchronization between training processes and vLLM workers for efficient RLHF implementation.
We are pleased to see vLLM collaborating with the reinforcement learning (RL) community and working on introducing useful APIs such as update_weight
as proposed in #5723. Following that, a seemingly good practice would be the implementation of OpenRLHF, which can use nccl (previously gloo) as the backend to collect weights for vLLM workers to update. Below is a simple implementation derived from the ongoing discussion in this comment:
# In main training process (rank 0)
# Broadcast the parameters on rank 0 to other ranks
torch.distributed.broadcast(param.data, 0, group=model_update_group)
# In vLLM Engine driver process
# Trigger the vLLM workers to update_weight
self.llm.llm_engine.model_executor._run_workers("update_weight", name, dtype, shape, empty_cache)
# In vLLM SPMD workers
# Create an empty tensor and then receive the param.data from rank 0
weight = torch.empty(shape, dtype=dtype, device="cuda")
torch.distributed.broadcast(weight, 0, group=self._model_update_group)
self.model_runner.model.load_weights(weights=[(name, weight)])
Using the above approach (depicted in Figure 1), weights on each vLLM worker are successfully updated. However, we have identified two significant inefficiencies, even after the implementation of #5723:
1. High weight synchronization overhead
- The centralization of the rank 0/driver process as a parameter server creates a bottleneck:
- Limited device memory on rank 0 necessitates layer-by-layer parameter loading.
- Training processes may need to dump parameters to disk as HF checkpoints before broadcasting them to vLLM processes.
2. Inflexibility
- vLLM workers are constrained by their inability to load weights directly from external processes. Currently, each vLLM worker creates an NCCL process group (
model_update_group
) during initialization and must allocate and manage separate buffers for receiving weights.
To address these issues, veRL utilizes weight resharding, where training worker processes directly transmit weight tensors to vLLM worker processes. This eliminates the need for rank 0 to act as a relay point (see Figure 2 for reference).

Figure 1. Driver process on rank 0 gathers the weights from training processes and broadcast to vLLM workers.
Proposed Change.

Figure 2. Each vLLM worker directly all-gathers the weights from GPU1 and GPU2.
We use the scenario in Figure 2(a) as illustration, where each vLLM worker process requires weight updates from multiple training processes. To facilitate efficient weight synchronization, we propose two potential solutions:
1. Weight Synchronization Communication Group
- Create a weight synchronization communication group (e.g., with NCCL as the backend) for vLLM workers.
- The driver process triggers customized weight resharding operations, enabling vLLM worker processes to share the same communication groups as training processes (see Figure 2(b)).
- Concerns:
- Compatibility with different training frameworks may be challenging.
- Maintaining these communication groups could increase complexity.
2. Direct Weight Access by vLLM Workers
- Enable vLLM worker processes to directly access weights in training processes, eliminating the need for intermediate weight buffers.
- Instead of handling tensors in
model_runner.model.load_weights(weights=[(name, weight)])
, weights would be represented as handlers to shared tensors across processes. For example:- When process V1 requires weights, it fetches them directly from training processes T1 and T2.
- This approach could leverage mechanisms like CUDA IPC shared memory or CUDA unified memory.
- Required Features:
- Implement handler parsing logic within the
load_weights
function. - Automate tensor fetching across processes.
- Implement handler parsing logic within the
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.