Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Detect node updates #9828

Merged
merged 6 commits into from
Aug 4, 2020
Merged

Conversation

edoakes
Copy link
Collaborator

@edoakes edoakes commented Jul 30, 2020

Why are these changes needed?

Adds an initial control loop that runs every 1s to detect new nodes in the cluster and add routers to them. Also detects removed nodes and removes corresponding actors from them.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@edoakes edoakes changed the title [serve] Detect new nodes and add routers to them [serve] Detect node updates Jul 30, 2020
@edoakes edoakes requested a review from simon-mo July 30, 2020 21:26
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29190/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29193/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29194/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29192/
Test FAILed.

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two items we need to do for future work:

  1. We should begin to add typing to improve readability. For example, self.routers can be typed as Dict[str, ActorHandle]
  2. We should starting to abstract out the checkpoint->start_if_needed->stop_if_needed->checkpoint procedure as it's started to get used in places.

@simon-mo
Copy link
Contributor

simon-mo commented Aug 3, 2020

Lint is failing on Tune code. I think those are fixed by a471214. Can you merge master?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29322/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29378/
Test FAILed.

@edoakes edoakes merged commit 55146d2 into ray-project:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants