[tune/core] Use Global State API for resources #3004

richardliaw · 2018-10-01T22:49:34Z

TODO:

~~Write multi-node tests to verify this works?~~ after Cluster Utilities for Fault Tolerance Tests #3008.

Relevant Issues:

robertnishihara · 2018-10-01T22:51:03Z

python/ray/tune/ray_trial_executor.py

            # TODO(rliaw): Remove once raylet flag is swapped
-            num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
-            num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients)
+            resources = ray.global_state.available_resources()


Note that this is a somewhat expensive call since it waits for heartbeats from each node (e.g., could take 100s of milliseconds).

AmplabJenkins · 2018-10-02T00:20:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8499/
Test FAILed.

AmplabJenkins · 2018-10-02T00:20:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8500/
Test PASSed.

pschafhalter · 2018-10-02T02:02:53Z

The changes for available_resources look good to me.

ericl · 2018-10-04T00:59:07Z

python/ray/experimental/state.py

        return dict(resources)

+    def _live_client_ids(self):
+        """Returns a set of client IDs corresponding to clients still alive."""


Does this actually work? I thought you can get an isinsertion and then a deletion.

we removed that behavior in #2880

AmplabJenkins · 2018-10-04T01:16:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8536/
Test FAILed.

robertnishihara · 2018-10-04T01:17:03Z

python/ray/experimental/state.py

                    if local_scheduler_id not in local_scheduler_ids:
                        del available_resources_by_id[local_scheduler_id]
        else:
+            # TODO(rliaw): Is this a fair assumption?


Yes, this is a safe assumption. self.redis_clients has one client per shard, and the number of shards doesn't change.

I'd remove this comment.

robertnishihara · 2018-10-04T01:19:28Z

python/ray/tune/ray_trial_executor.py

            # TODO(rliaw): Remove once raylet flag is swapped
-            num_cpus = sum(cl['Resources']['CPU'] for cl in clients)
-            num_gpus = sum(cl['Resources'].get('GPU', 0) for cl in clients)
+            resources = ray.global_state.cluster_resources()


You're using cluster_resources but you modified available_resources. Doesn't it make sense to make the same change in cluster_resources?

Oh, the Tune and core changes are sort of orthogonal after you pointed out I didn't need available_resources... cluster_resources doesn't have the same problem that available_resources has.

Oh, I see, but cluster_resources does count resources from dead nodes and probably shouldn't, right?

Also, available_resources could still hang if one of the nodes dies at a very unfortunate time, right?

cluster_resources gets its list of resources from the client_table, which clears out the resources dict for a dead node.

The one case I can think of where available_resources might hang is if one of the redis client dies in the middle... are there others?

If a node dies within a 10 second window of the call to client_table() then it won't have been marked as dead yet in the client table and so the condition while set(available_resources_by_id.keys()) != client_ids: may not be met, so we'll hang there.

It probably makes sense to break out if it is hasn't returned within e.g., 200ms and log a warning.

Alternatively, we don't even need to call client_table() we can just listen for e.g., 200ms and then return the info from whatever heartbeats we collected.

robertnishihara · 2018-10-05T00:23:11Z

I'm going to merge this and make some changes in a different PR.

richardliaw added 2 commits October 1, 2018 15:44

fix available state

5cfe14c

Fix Tune to support node failures?

cacb001

robertnishihara reviewed Oct 1, 2018

View reviewed changes

note

a967358

yapf

f8f0b00

richardliaw assigned ericl and robertnishihara Oct 3, 2018

ericl reviewed Oct 4, 2018

View reviewed changes

robertnishihara reviewed Oct 4, 2018

View reviewed changes

ericl approved these changes Oct 4, 2018

View reviewed changes

robertnishihara merged commit 0651d3b into ray-project:master Oct 5, 2018

robertnishihara deleted the fix_available_resources branch October 5, 2018 00:23

richardliaw mentioned this pull request Nov 12, 2018

Global State Available Resources Hangs on Node Removal #2875

Closed

Uh oh!

[tune/core] Use Global State API for resources #3004

[tune/core] Use Global State API for resources #3004

Uh oh!

Conversation

richardliaw commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant Issues:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 2, 2018

Uh oh!

AmplabJenkins commented Oct 2, 2018

Uh oh!

pschafhalter commented Oct 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertnishihara commented Oct 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

richardliaw commented Oct 1, 2018 •

edited

Loading