-
Couldn't load subscription status.
- Fork 6.8k
[tune/core] Use Global State API for resources #3004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/core] Use Global State API for resources #3004
Conversation
| # TODO(rliaw): Remove once raylet flag is swapped | ||
| num_cpus = sum(cl['Resources']['CPU'] for cl in clients) | ||
| num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients) | ||
| resources = ray.global_state.available_resources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a somewhat expensive call since it waits for heartbeats from each node (e.g., could take 100s of milliseconds).
|
Test FAILed. |
|
Test PASSed. |
|
The changes for |
| return dict(resources) | ||
|
|
||
| def _live_client_ids(self): | ||
| """Returns a set of client IDs corresponding to clients still alive.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work? I thought you can get an isinsertion and then a deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we removed that behavior in #2880
|
Test FAILed. |
| if local_scheduler_id not in local_scheduler_ids: | ||
| del available_resources_by_id[local_scheduler_id] | ||
| else: | ||
| # TODO(rliaw): Is this a fair assumption? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a safe assumption. self.redis_clients has one client per shard, and the number of shards doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this comment.
| # TODO(rliaw): Remove once raylet flag is swapped | ||
| num_cpus = sum(cl['Resources']['CPU'] for cl in clients) | ||
| num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients) | ||
| resources = ray.global_state.cluster_resources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using cluster_resources but you modified available_resources. Doesn't it make sense to make the same change in cluster_resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the Tune and core changes are sort of orthogonal after you pointed out I didn't need available_resources... cluster_resources doesn't have the same problem that available_resources has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, but cluster_resources does count resources from dead nodes and probably shouldn't, right?
Also, available_resources could still hang if one of the nodes dies at a very unfortunate time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_resources gets its list of resources from the client_table, which clears out the resources dict for a dead node.
The one case I can think of where available_resources might hang is if one of the redis client dies in the middle... are there others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node dies within a 10 second window of the call to client_table() then it won't have been marked as dead yet in the client table and so the condition while set(available_resources_by_id.keys()) != client_ids: may not be met, so we'll hang there.
It probably makes sense to break out if it is hasn't returned within e.g., 200ms and log a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we don't even need to call client_table() we can just listen for e.g., 200ms and then return the info from whatever heartbeats we collected.
|
I'm going to merge this and make some changes in a different PR. |
TODO:
Write multi-node tests to verify this works?after Cluster Utilities for Fault Tolerance Tests #3008.Relevant Issues:
#2875, #2840, #2851
cc @pschafhalter