Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug of unregistered workers in worker pool #7343

Merged
merged 14 commits into from
Mar 2, 2020

Conversation

jovany-wang
Copy link
Contributor

@jovany-wang jovany-wang commented Feb 27, 2020

What's the issue

The reason for Java CI issues is figured out:

In CheckpointableTest and ReconstrcutionTest, we will kill a worker process to trigger the failover of actor. There're multiple worker threads in a worker process, once we kill a worker process which has some worker threads not registered to the raylet, the worker threads will be bimzie workers. Then StartWorkerProcess will return early at https://github.com/ray-project/ray/blob/master/src/ray/raylet/worker_pool.cc#L137

This also fix another case testDeleteObject in direct call, otherwise the ci couldn't pass.

How to Fix

Add a timer for worker process to check if the worker is timeout to register to raylet.

@jovany-wang jovany-wang mentioned this pull request Feb 27, 2020
3 tasks
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@raulchen raulchen requested review from kfstorm and raulchen February 27, 2020 01:53
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.h Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22482/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22475/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22483/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22484/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22486/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22534/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22530/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22541/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22540/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22546/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22551/
Test FAILed.

Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM. Please fix the checkstyle warnings though.

Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22557/
Test PASSed.

src/ray/raylet/worker_pool_test.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.h Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.h Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
@raulchen
Copy link
Contributor

Also, consider changing the title to something like Fix the bug of unregistered workers in worker pool. Because this is not a Java-specific bug.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please fix the small issues before merging.

jovany-wang and others added 4 commits February 28, 2020 22:52
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
@jovany-wang jovany-wang changed the title [Fix Java CI] Add a timer for zombie worker process Fix the bug of unregistered workers in worker pool Feb 28, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22560/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22562/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22580/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22600/
Test FAILed.

@jovany-wang jovany-wang merged commit 2771af1 into ray-project:master Mar 2, 2020
@jovany-wang jovany-wang deleted the ci_fixing branch March 2, 2020 08:30
ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020
* Fix

* Fix

* Fix complie

* Fix lint

* Fix linting

* Fix testDeleteObject

* Fix linting

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comments.

* FIx linting

Co-authored-by: Hao Chen <chenh1024@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants