Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler node counting #2320

Merged
merged 1 commit into from
Jun 28, 2018
Merged

Conversation

AdamGleave
Copy link
Contributor

What do these changes do?

  • Fixes off-by-one error in number of nodes, due to the head node not being counted as a node. Specifically, I subtract 1 from the target number of workers (but not below min_workers), taking into account that the head node itself can have tasks scheduled on it. (Indeed, if min_workers is zero, the head node must allow at least one task to be scheduled on it, otherwise we never scale up; see Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192, Autoscaler never scales up initially-zero resource #2106)
  • Only terminate idle workers if it does not drop us below the target number of workers. Previously the check was just if it is below the minimum number of workers. This led to, at particular resource utilization levels, cyclically killing idle nodes and then immediately restarting them leading to unnecessary worker churn.

Caveat: I'm a little uneasy about counting the head node as equivalent to a worker node. In large clusters one might want the head node dedicated for managing the cluster, and not have any tasks scheduled on it. However, I expect the impact of these changes in large clusters to be minimal -- it'll just mean a slightly slower initial scale up, but it won't effect the steady-state performance. Moreover, the example configs do allow work to happen on the head node, and have min_workers set to 0 -- so I think it's important we properly support this config out of the box. A further tweak might be to add a check in target_num_workers for if the head node has non-zero static CPU resources, and only subtract 1 if that is the case.

Related issue number

Fixes #2317.

…ct#2317)

Specifically, subtracts 1 from the target number of workers, taking into
account that the head node has some computational resources.

Do not kill an idle node if it would drop us below the target number of
nodes (in which case we just immediately relaunch).
@AdamGleave
Copy link
Contributor Author

Request review from @ericl

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6371/
Test PASSed.

@ericl ericl merged commit 89460b8 into ray-project:master Jun 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants