-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: OpenRLHF: How can I create a second NCCL Group in a vLLM v0.4.3+ Ray worker? #5477
Comments
@njhill Do you have any insights? Thanks. |
@hijkzzz I don't have any immediate insight. I can take a closer look but can't promise how soon. We could also consider adding a flag to disable the behaviour introduced in #4894, in particular to have the remote worker "loop" always exit after a single iteration. There would be a performance downside to that but it may help with cases like yours. |
actually, I'm quite surprised that it worked previously. vLLM should take control over all distributed initialization and destruction. How can you add another process into the group? |
We hacked the |
This is quite hacky. If possible, I suggest sharing cuda tensors across process, e.g. if vLLM has TP processes, and your DeepSpeed process group also has TP processes, they can share cuda tensor without copying around. It requires the two groups own the same set of tensors though. |
This cannot meet the requirements for multi-machine distributed training in RLHF. |
This bug can be solved by |
Hi there, has there been an update or workaround for this issue? Thanks! |
Your current environment
We are working on accelerating RLHF algorithms and need to broadcast the weights of the DeepSpeed engine to the vLLM Ray worker. In v0.4.2, we were able to create an additional NCCL group to achieve this. However, after updating to v0.4.3 and incorporating the changes from this MR, we found that doing so causes NCCL errors during broadcast.
Our weight synchronization code is located at: https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_engine.py.
and
https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_worker_wrap.py
see
init_process_group
(build NCCL group between vLLM and DeepSpeed namedself._model_update_group
)and
update_weight
(Broadcast weights from DeepSpeed to vLLM,torch.distributed.broadcast(weight, 0, group=self._model_update_group)
)We temporarily replaced the NCCL backend with GLOO to make it work, but the performance was poor。
The error message is:
Even call
self.llm.llm_engine.model_executor.stop_remote_worker_execution_loop()
before broadcast, there will still be one other NCCL error.I think our code
torch.distributed.broadcast(weight, 0, group=self._model_update_group)
may be conflicts with this this MR. btw, I'm not sure how to fix it.The text was updated successfully, but these errors were encountered: