Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Change the get behavior of node providers' _get_node #4132

Merged

Conversation

hartikainen
Copy link
Contributor

@hartikainen hartikainen commented Feb 22, 2019

What do these changes do?

Change the get behavior of node providers' _get_node such that terminated nodes do not get filtered out.

Related issue number

#4072

@ericl @richardliaw

@hartikainen
Copy link
Contributor Author

Noting that this still needs to be tested of GCP.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12231/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12229/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12232/
Test FAILed.

@hartikainen
Copy link
Contributor Author

hartikainen commented Feb 22, 2019

I don't know if this fully solves the problem. I ran an experiment with config where min_workers=0, max_workers=50, initial_workers=10. Here are some issues:

  1. Ray attach hangs
  • ray attach /home/kristian/github/haarnoja/softqlearning-private/config/ray-autoscaler-gce.yaml --cluster-name=20190222t142143-109935-halfcheetah-v2 --tmux
  • This might be due to the recent stdout changes and unrelated to our changes.
  1. I don't think the initial_workers ever took effect. My cluster never got scaled beyond 8 nodes.
  2. I manually preempted one of my nodes. I think it somehow messed up the scaling since my node count constantly fluctuates between 7 and 8 nodes. Looking at the monitor logs I see:
2019-02-22 23:34:43,996 INFO autoscaler.py:610 -- LoadMetrics: NodeIdleSeconds=Min=0 Mean=156 Max=1409, NumNodesConnected=9, NumNodesUsed=8.0, ResourceUsage=64.0/72.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=151 Max=1503
  • The TimeSinceLastHeartbeat Max keeps increasing. I guess this is due to the preempted node. Is this expected?
  • I don't see anything obviously wrong in the logs. I wonder why the scaling up doesn't happen since I still have 156 trials pending and max_workers would allow another 42 nodes to be launched.

@ericl
Copy link
Contributor

ericl commented Feb 22, 2019

Hm. One thing to try is to completely disable the caching and see if that matters.

For * TimeSinceLastHeartbeat Max keeps increasing, I think this is expected.

Can you post your logs? It probably won't scale up if your utilization isn't high enough.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12246/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12247/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12248/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12249/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12272/
Test FAILed.

@ericl ericl merged commit 524e69a into ray-project:master Feb 25, 2019
@hartikainen hartikainen deleted the fix/node-provider-_get_node-caching branch April 10, 2019 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants