-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] fix pynccl del error #4508
Conversation
TL;DR; destruction of nccl is a collective call, like a broadcast or allreduce, and thus is blocking. However, it is not guaranteed to be called in the same order because Python's garbage collection system destructs objects in random order. We cannot rely on Python's garbage collection system to work here, because The driver process holds a communicator and a handle to ray actor, the worker process holds a communicator. If the driver process calls If the driver process calls Things can go crazy when we have multiple communicators (e.g. PyTorch One possible solution is to add cleanup logic in The ultimate solution might be to provide some context manager like |
Per our offline discussion with @zhuohan123 @WoosukKwon @simon-mo @LiuXiaoxuanPKU , we can just skip the destruction to avoid deadlocks. |
It is observed from #4488 , that CI actually has errors, although it is ignored. Therefore, essentially the
ncclCommDestroy
inNCCLCommunicator.__del__
is never called. We can remove the code to avoid the CI error bothering users.