Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Raise Error when there are insufficient resources. #17957

Merged
merged 7 commits into from
Sep 3, 2021

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Aug 19, 2021

When autoscaler is disabled, raise Error when there are insufficient resources.

Why are these changes needed?

When autoscaler is disabled, raise Error when there are insufficient resources.
Tested by running on a local cluster on my laptop.

Sample output:

Traceback (most recent call last):
  File "single_process_example.py", line 51, in <module>
    main()
  File "single_process_example.py", line 37, in main
    result = tune.run(
  File "/Users/xwjiang/ray/python/ray/tune/tune.py", line 535, in run
    runner.step()
  File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 593, in step
    self._run_and_catch(self.trial_executor.on_no_available_trials)
  File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 373, in _run_and_catch
    func(self.get_trials())
  File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 276, in on_no_available_trials
    self._may_warn_insufficient_resources(trials)
  File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 259, in _may_warn_insufficient_resources
    raise TuneError(_get_warning_msg())
ray.tune.error.TuneError: No trial is running and no new trial has been started within at least the last 5.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010 xwjiang2010 requested a review from krfricke August 19, 2021 18:22
@xwjiang2010 xwjiang2010 force-pushed the tmp branch 4 times, most recently from 7610244 to 9411116 Compare August 20, 2021 04:10
Comment on lines 282 to 286
raise TuneError(
_get_insufficient_resources_warning_msg())
# TODO(xwjiang): Output a more helpful msg for autoscaler case.
# https://github.com/ray-project/ray/issues/17799
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for both these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the tests. One specifically for verifying the TuneError raised.
Not sure how to cover autoscaler case tho. There doesn't seem to be a good way of mocking it.

@xwjiang2010 xwjiang2010 force-pushed the tmp branch 3 times, most recently from 6a33ee4 to 24c3585 Compare September 2, 2021 00:49
f"This could be due to the cluster not having enough "
f"resources available to start the next trial. Please stop "
f"the tuning job and readjust resources_per_trial argument "
f"passed into tune.run() and/or start a cluster with more "
f"resources.")


# A beefed up version when Tune Error is raised.
def _get_insufficient_resources_error_msg(trial_resources: Resources) -> str:
return (f"You ask for {trial_resources.cpu} cpu and {trial_resources.gpu} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually work (does it capture the distributed placement group resources too)? @krfricke

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way to test this is just to try using a RLlib agent and make the number of workers > num cpus on the machine. I suspect that trial_resources.cpu is not actually the total sum of resources requested. You might need to try something else (like check the placement group and the sum of the resources on the placement group).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (f"You ask for {trial_resources.cpu} cpu and {trial_resources.gpu} "
return (f"You asked for {trial_resources.cpu} cpu and {trial_resources.gpu} "

Copy link
Contributor Author

@xwjiang2010 xwjiang2010 Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer Richard. You are right, for that case, I need to use information from PlacementGroup.
So I used coin_game_env.py as my training job. I updated num_workers = 17 (I have 16 CPU cores).
On clean master (which currently has the bug and thus spammy):

(ray)  xwjiang@xw  ~/ray   master ±  python rllib/examples/coin_game_env.py
2021-09-02 21:13:16,008	INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/19.07 GiB heap, 0.0/9.54 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name               | status   | loc   |   seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_44a68_00000 | PENDING  |       |      0 |
+--------------------------+----------+-------+--------+


2021-09-02 21:13:19,617	WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:20,618	WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:21,618	WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:22,618	WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:23,619	WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/19.07 GiB heap, 0.0/9.54 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name               | status   | loc   |   seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_44a68_00000 | PENDING  |       |      0 |
+--------------------------+----------+-------+--------+

With the change (not spammy as well as raise a TuneError to user with information):

(ray)  ✘ xwjiang@xw  ~/ray   tmp ±  python rllib/examples/coin_game_env.py
2021-09-02 21:54:02,425	INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name               | status   | loc   |   seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING  |       |      0 |
+--------------------------+----------+-------+--------+


== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name               | status   | loc   |   seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING  |       |      0 |
+--------------------------+----------+-------+--------+


== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name               | status   | loc   |   seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING  |       |      0 |
+--------------------------+----------+-------+--------+


Traceback (most recent call last):
  File "rllib/examples/coin_game_env.py", line 83, in <module>
    main(debug_mode, args.stop_iters, args.tf, use_asymmetric_env)
  File "rllib/examples/coin_game_env.py", line 68, in main
    tune_analysis = tune.run(
  File "/Users/xwjiang/ray/python/ray/tune/tune.py", line 585, in run
    runner.step()
  File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 627, in step
    self._run_and_catch(self.trial_executor.on_no_available_trials)
  File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 394, in _run_and_catch
    func(self.get_trials())
  File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 328, in on_no_available_trials
    self._may_warn_insufficient_resources(trials)
  File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 307, in _may_warn_insufficient_resources
    raise TuneError(
ray.tune.error.TuneError: You asked for 18.0 cpu and 0.0 gpu per trial, but the cluster only has 16.0 cpu and 0 gpu.The cluster does not have enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.

Richard, so my error information is directing users to resources_per_trial, while for this case, we should direct users to num_workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both hints in the message. It may look a bit bloated, let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really like the new messaging/warning UI.

return float(
os.environ.get("TUNE_WARN_INSUFFICENT_RESOURCE_THRESHOLD_S", "1"))
os.environ.get("TUNE_WARN_INSUFFICENT_RESOURCE_THRESHOLD_S", "10"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like on the product, this is being raised too frequently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in current version of code is less about the frequency, but rather the assumption about condition checking is off. See this line. It incorrectly concludes if every time we arrive at this condition check with no running trial, we are not making progress - which is wrong.
E.g. [trial1(PENDING)] --> [trial1(TERMINATED), trial2(PENDING)] --> [trial1(TERMINATED), trial2(TERMINATED), trial3(PENDING)] etc. Every time, there is no running trial, but progress is made.

The current PR addresses this issue by changing the condition check to the length of all_trials, which corrects the bug.

As for the frequency, a warning message is printed for autoscaler case every 60s. For no autoscaler case, we will throw an error after 10s if the cluster doesn't have enough resources to fulfill a Trial, which is a hard fail(no repeated printing).

If 60s and 10s are too short, we can certainly elongate that. We just need to trade off with the time that user starts to wonder what is happening with pending trials without being informed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably test this in the product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Although the nature of the bug is more within the core Tune code. See the response to Richard's comment above. I pasted a SxS comparison of w and w/o the change. Let me know if you want some more coverage.

Also Richard, Ameer, if we need more time to settle this PR, should we first revert the master?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this PR is good (with a small adjustment to the error message).

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good (with a small adjustment to the error message). Recommend whoever is able to merge to merge at will upon adjustment.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good and seems to work well. Just a couple of nits

@AmeerHajAli
Copy link
Contributor

@xwjiang2010 , I think that the assumption here is that autoscaler starts with the cluster, but autoscaler can sometimes take a few minutes to start (in some cases), wouldn't this crash?

@xwjiang2010
Copy link
Contributor Author

@xwjiang2010 , I think that the assumption here is that autoscaler starts with the cluster, but autoscaler can sometimes take a few minutes to start (in some cases), wouldn't this crash?

@AmeerHajAli A hard crash will only be for "no autoscaler" case. For autoscaler case, since we don't know definitively what is specified in cluter.yaml file, we only print the following msg every 60s by default:

If autoscaler is still scaling up, ignore this message.  No trial is running and no new trial has been started within at least the last 60s. This could be due to the cluster not having enough resources available to start the next trial. Stop the tuning job and adjust the resources requested per trial (possibly via `resources_per_trial` or via `num_workers` for rllib) and/or add more resources to your Ray runtime.

We can elongate 60s for sure. Let me know what is a good number from your experience with autoscaler.
We can also remove this case if you prefer not to print anything. Let me know!

@richardliaw richardliaw merged commit 01adf03 into ray-project:master Sep 3, 2021
@xwjiang2010 xwjiang2010 deleted the tmp branch July 26, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants