Autoscaler node counting #2320

AdamGleave · 2018-06-28T18:43:40Z

What do these changes do?

Fixes off-by-one error in number of nodes, due to the head node not being counted as a node. Specifically, I subtract 1 from the target number of workers (but not below min_workers), taking into account that the head node itself can have tasks scheduled on it. (Indeed, if min_workers is zero, the head node must allow at least one task to be scheduled on it, otherwise we never scale up; see Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192, Autoscaler never scales up initially-zero resource #2106)
Only terminate idle workers if it does not drop us below the target number of workers. Previously the check was just if it is below the minimum number of workers. This led to, at particular resource utilization levels, cyclically killing idle nodes and then immediately restarting them leading to unnecessary worker churn.

Caveat: I'm a little uneasy about counting the head node as equivalent to a worker node. In large clusters one might want the head node dedicated for managing the cluster, and not have any tasks scheduled on it. However, I expect the impact of these changes in large clusters to be minimal -- it'll just mean a slightly slower initial scale up, but it won't effect the steady-state performance. Moreover, the example configs do allow work to happen on the head node, and have min_workers set to 0 -- so I think it's important we properly support this config out of the box. A further tweak might be to add a check in target_num_workers for if the head node has non-zero static CPU resources, and only subtract 1 if that is the case.

Related issue number

Fixes #2317.

…ct#2317) Specifically, subtracts 1 from the target number of workers, taking into account that the head node has some computational resources. Do not kill an idle node if it would drop us below the target number of nodes (in which case we just immediately relaunch).

AdamGleave · 2018-06-28T18:44:03Z

Request review from @ericl

ericl

Looks good!

AmplabJenkins · 2018-06-28T19:53:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6371/
Test PASSed.

ericl approved these changes Jun 28, 2018

View reviewed changes

ericl merged commit 89460b8 into ray-project:master Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler node counting #2320

Autoscaler node counting #2320

AdamGleave commented Jun 28, 2018

AdamGleave commented Jun 28, 2018

ericl left a comment

AmplabJenkins commented Jun 28, 2018

Autoscaler node counting #2320

Autoscaler node counting #2320

Conversation

AdamGleave commented Jun 28, 2018

What do these changes do?

Related issue number

AdamGleave commented Jun 28, 2018

ericl left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Jun 28, 2018