[autoscaler] Change the get behavior of node providers' `_get_node` #4132

hartikainen · 2019-02-22T18:51:43Z

What do these changes do?

Change the get behavior of node providers' _get_node such that terminated nodes do not get filtered out.

Related issue number

#4072

@ericl @richardliaw

hartikainen · 2019-02-22T19:52:04Z

Noting that this still needs to be tested of GCP.

AmplabJenkins · 2019-02-22T20:23:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12231/
Test FAILed.

AmplabJenkins · 2019-02-22T20:32:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12229/
Test FAILed.

AmplabJenkins · 2019-02-22T20:45:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12232/
Test FAILed.

hartikainen · 2019-02-22T23:40:08Z

I don't know if this fully solves the problem. I ran an experiment with config where min_workers=0, max_workers=50, initial_workers=10. Here are some issues:

Ray attach hangs

ray attach /home/kristian/github/haarnoja/softqlearning-private/config/ray-autoscaler-gce.yaml --cluster-name=20190222t142143-109935-halfcheetah-v2 --tmux
This might be due to the recent stdout changes and unrelated to our changes.

I don't think the initial_workers ever took effect. My cluster never got scaled beyond 8 nodes.
I manually preempted one of my nodes. I think it somehow messed up the scaling since my node count constantly fluctuates between 7 and 8 nodes. Looking at the monitor logs I see:

2019-02-22 23:34:43,996 INFO autoscaler.py:610 -- LoadMetrics: NodeIdleSeconds=Min=0 Mean=156 Max=1409, NumNodesConnected=9, NumNodesUsed=8.0, ResourceUsage=64.0/72.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=151 Max=1503

The TimeSinceLastHeartbeat Max keeps increasing. I guess this is due to the preempted node. Is this expected?
I don't see anything obviously wrong in the logs. I wonder why the scaling up doesn't happen since I still have 156 trials pending and max_workers would allow another 42 nodes to be launched.

ericl · 2019-02-22T23:43:11Z

Hm. One thing to try is to completely disable the caching and see if that matters.

For * TimeSinceLastHeartbeat Max keeps increasing, I think this is expected.

Can you post your logs? It probably won't scale up if your utilization isn't high enough.

AmplabJenkins · 2019-02-23T02:11:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12246/
Test FAILed.

AmplabJenkins · 2019-02-23T02:17:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12247/
Test FAILed.

AmplabJenkins · 2019-02-23T02:29:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12248/
Test FAILed.

AmplabJenkins · 2019-02-23T03:19:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12249/
Test FAILed.

AmplabJenkins · 2019-02-24T11:30:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12272/
Test FAILed.

hartikainen and others added 5 commits February 22, 2019 10:50

Change the get behavior of GCPNodeProvider._get_node

901ebda

Add lock around the GCPNodeProvider._get_node call

b35fbd7

rename nodes

31e9772

lint

0d8096c

Update GCPNodeProvider._get_node to match aws implementation

7ead622

ericl approved these changes Feb 22, 2019

View reviewed changes

ericl added 6 commits February 22, 2019 16:28

assert

0092be2

log

231799d

log highest heartbeats

cb66598

rename

8bc4f09

bringup to connected

d7f7261

prune heartbeat times

0eba127

fix bringup

cbdbb7d

Merge branch 'master' into fix/node-provider-_get_node-caching

6faa515

ericl merged commit 524e69a into ray-project:master Feb 25, 2019

hartikainen deleted the fix/node-provider-_get_node-caching branch April 10, 2019 17:28

g-goessel mentioned this pull request Feb 28, 2024

[Autoscaler][GCP] SSL errors in googleapiclient #43503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Change the get behavior of node providers' `_get_node` #4132

[autoscaler] Change the get behavior of node providers' `_get_node` #4132

hartikainen commented Feb 22, 2019 •

edited

Loading

hartikainen commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

hartikainen commented Feb 22, 2019 •

edited

Loading

ericl commented Feb 22, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 24, 2019

[autoscaler] Change the get behavior of node providers' _get_node #4132

[autoscaler] Change the get behavior of node providers' _get_node #4132

Conversation

hartikainen commented Feb 22, 2019 • edited Loading

What do these changes do?

Related issue number

hartikainen commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

AmplabJenkins commented Feb 22, 2019

hartikainen commented Feb 22, 2019 • edited Loading

ericl commented Feb 22, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 23, 2019

AmplabJenkins commented Feb 24, 2019

[autoscaler] Change the get behavior of node providers' `_get_node` #4132

[autoscaler] Change the get behavior of node providers' `_get_node` #4132

hartikainen commented Feb 22, 2019 •

edited

Loading

hartikainen commented Feb 22, 2019 •

edited

Loading