-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the bug of unregistered workers in worker pool #7343
Conversation
Can one of the admins verify this patch? |
Test FAILed. |
Test FAILed. |
b8ffa9e
to
26dedc2
Compare
Test FAILed. |
Test FAILed. |
418dfc6
to
fcd1a06
Compare
Test PASSed. |
Test FAILed. |
Test PASSed. |
0181a65
to
abb77fe
Compare
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! LGTM. Please fix the checkstyle warnings though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Test PASSed. |
Also, consider changing the title to something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Please fix the small issues before merging.
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
* Fix * Fix * Fix complie * Fix lint * Fix linting * Fix testDeleteObject * Fix linting * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.h Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comments. * FIx linting Co-authored-by: Hao Chen <chenh1024@gmail.com>
What's the issue
The reason for Java CI issues is figured out:
In
CheckpointableTest
andReconstrcutionTest
, we will kill a worker process to trigger the failover of actor. There're multiple worker threads in a worker process, once we kill a worker process which has some worker threads not registered to the raylet, the worker threads will be bimzie workers. ThenStartWorkerProcess
will return early at https://github.com/ray-project/ray/blob/master/src/ray/raylet/worker_pool.cc#L137This also fix another case
testDeleteObject
in direct call, otherwise the ci couldn't pass.How to Fix
Add a timer for worker process to check if the worker is timeout to register to raylet.