Skip to content

Conversation

@richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Oct 1, 2018

TODO:

Relevant Issues:

#2875, #2840, #2851

cc @pschafhalter

# TODO(rliaw): Remove once raylet flag is swapped
num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients)
resources = ray.global_state.available_resources()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a somewhat expensive call since it waits for heartbeats from each node (e.g., could take 100s of milliseconds).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8499/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8500/
Test PASSed.

@pschafhalter
Copy link
Contributor

The changes for available_resources look good to me.

return dict(resources)

def _live_client_ids(self):
"""Returns a set of client IDs corresponding to clients still alive."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work? I thought you can get an isinsertion and then a deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed that behavior in #2880

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8536/
Test FAILed.

if local_scheduler_id not in local_scheduler_ids:
del available_resources_by_id[local_scheduler_id]
else:
# TODO(rliaw): Is this a fair assumption?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a safe assumption. self.redis_clients has one client per shard, and the number of shards doesn't change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment.

# TODO(rliaw): Remove once raylet flag is swapped
num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients)
resources = ray.global_state.cluster_resources()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using cluster_resources but you modified available_resources. Doesn't it make sense to make the same change in cluster_resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the Tune and core changes are sort of orthogonal after you pointed out I didn't need available_resources... cluster_resources doesn't have the same problem that available_resources has.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, but cluster_resources does count resources from dead nodes and probably shouldn't, right?

Also, available_resources could still hang if one of the nodes dies at a very unfortunate time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_resources gets its list of resources from the client_table, which clears out the resources dict for a dead node.

The one case I can think of where available_resources might hang is if one of the redis client dies in the middle... are there others?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node dies within a 10 second window of the call to client_table() then it won't have been marked as dead yet in the client table and so the condition while set(available_resources_by_id.keys()) != client_ids: may not be met, so we'll hang there.

It probably makes sense to break out if it is hasn't returned within e.g., 200ms and log a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we don't even need to call client_table() we can just listen for e.g., 200ms and then return the info from whatever heartbeats we collected.

@robertnishihara
Copy link
Collaborator

I'm going to merge this and make some changes in a different PR.

@robertnishihara robertnishihara merged commit 0651d3b into ray-project:master Oct 5, 2018
@robertnishihara robertnishihara deleted the fix_available_resources branch October 5, 2018 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants