-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Raise Error when there are insufficient resources. #17957
Conversation
7610244
to
9411116
Compare
python/ray/tune/trial_executor.py
Outdated
raise TuneError( | ||
_get_insufficient_resources_warning_msg()) | ||
# TODO(xwjiang): Output a more helpful msg for autoscaler case. | ||
# https://github.com/ray-project/ray/issues/17799 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test for both these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the tests. One specifically for verifying the TuneError raised.
Not sure how to cover autoscaler case tho. There doesn't seem to be a good way of mocking it.
6a33ee4
to
24c3585
Compare
python/ray/tune/trial_executor.py
Outdated
f"This could be due to the cluster not having enough " | ||
f"resources available to start the next trial. Please stop " | ||
f"the tuning job and readjust resources_per_trial argument " | ||
f"passed into tune.run() and/or start a cluster with more " | ||
f"resources.") | ||
|
||
|
||
# A beefed up version when Tune Error is raised. | ||
def _get_insufficient_resources_error_msg(trial_resources: Resources) -> str: | ||
return (f"You ask for {trial_resources.cpu} cpu and {trial_resources.gpu} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually work (does it capture the distributed placement group resources too)? @krfricke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the way to test this is just to try using a RLlib agent and make the number of workers > num cpus on the machine. I suspect that trial_resources.cpu
is not actually the total sum of resources requested. You might need to try something else (like check the placement group and the sum of the resources on the placement group).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (f"You ask for {trial_resources.cpu} cpu and {trial_resources.gpu} " | |
return (f"You asked for {trial_resources.cpu} cpu and {trial_resources.gpu} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer Richard. You are right, for that case, I need to use information from PlacementGroup.
So I used coin_game_env.py
as my training job. I updated num_workers
= 17 (I have 16 CPU cores).
On clean master (which currently has the bug and thus spammy):
(ray) xwjiang@xw ~/ray master ± python rllib/examples/coin_game_env.py
2021-09-02 21:13:16,008 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/19.07 GiB heap, 0.0/9.54 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name | status | loc | seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_44a68_00000 | PENDING | | 0 |
+--------------------------+----------+-------+--------+
2021-09-02 21:13:19,617 WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:20,618 WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:21,618 WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:22,618 WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
2021-09-02 21:13:23,619 WARNING trial_executor.py:247 -- No trial is running and no new trial has been started within at least the last 1.0 seconds. This could be due to the cluster not having enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/19.07 GiB heap, 0.0/9.54 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name | status | loc | seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_44a68_00000 | PENDING | | 0 |
+--------------------------+----------+-------+--------+
With the change (not spammy as well as raise a TuneError to user with information):
(ray) ✘ xwjiang@xw ~/ray tmp ± python rllib/examples/coin_game_env.py
2021-09-02 21:54:02,425 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8266
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name | status | loc | seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING | | 0 |
+--------------------------+----------+-------+--------+
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name | status | loc | seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING | | 0 |
+--------------------------+----------+-------+--------+
== Status ==
Memory usage on this node: 37.3/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/18.95 GiB heap, 0.0/9.48 GiB objects
Result logdir: /Users/xwjiang/ray_results/PPO_AsymCG
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+--------+
| Trial name | status | loc | seed |
|--------------------------+----------+-------+--------|
| PPO_CoinGame_f6a18_00000 | PENDING | | 0 |
+--------------------------+----------+-------+--------+
Traceback (most recent call last):
File "rllib/examples/coin_game_env.py", line 83, in <module>
main(debug_mode, args.stop_iters, args.tf, use_asymmetric_env)
File "rllib/examples/coin_game_env.py", line 68, in main
tune_analysis = tune.run(
File "/Users/xwjiang/ray/python/ray/tune/tune.py", line 585, in run
runner.step()
File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 627, in step
self._run_and_catch(self.trial_executor.on_no_available_trials)
File "/Users/xwjiang/ray/python/ray/tune/trial_runner.py", line 394, in _run_and_catch
func(self.get_trials())
File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 328, in on_no_available_trials
self._may_warn_insufficient_resources(trials)
File "/Users/xwjiang/ray/python/ray/tune/trial_executor.py", line 307, in _may_warn_insufficient_resources
raise TuneError(
ray.tune.error.TuneError: You asked for 18.0 cpu and 0.0 gpu per trial, but the cluster only has 16.0 cpu and 0 gpu.The cluster does not have enough resources available to start the next trial. Please stop the tuning job and readjust resources_per_trial argument passed into tune.run() and/or start a cluster with more resources.
Richard, so my error information is directing users to resources_per_trial
, while for this case, we should direct users to num_workers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added both hints in the message. It may look a bit bloated, let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I really like the new messaging/warning UI.
return float( | ||
os.environ.get("TUNE_WARN_INSUFFICENT_RESOURCE_THRESHOLD_S", "1")) | ||
os.environ.get("TUNE_WARN_INSUFFICENT_RESOURCE_THRESHOLD_S", "10")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like on the product, this is being raised too frequently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in current version of code is less about the frequency, but rather the assumption about condition checking is off. See this line. It incorrectly concludes if every time we arrive at this condition check with no running trial, we are not making progress - which is wrong.
E.g. [trial1(PENDING)] --> [trial1(TERMINATED), trial2(PENDING)] --> [trial1(TERMINATED), trial2(TERMINATED), trial3(PENDING)] etc. Every time, there is no running trial, but progress is made.
The current PR addresses this issue by changing the condition check to the length of all_trials, which corrects the bug.
As for the frequency, a warning message is printed for autoscaler case every 60s. For no autoscaler case, we will throw an error after 10s if the cluster doesn't have enough resources to fulfill a Trial, which is a hard fail(no repeated printing).
If 60s and 10s are too short, we can certainly elongate that. We just need to trade off with the time that user starts to wonder what is happening with pending trials without being informed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably test this in the product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Although the nature of the bug is more within the core Tune code. See the response to Richard's comment above. I pasted a SxS comparison of w and w/o the change. Let me know if you want some more coverage.
Also Richard, Ameer, if we need more time to settle this PR, should we first revert the master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this PR is good (with a small adjustment to the error message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is good (with a small adjustment to the error message). Recommend whoever is able to merge to merge at will upon adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good and seems to work well. Just a couple of nits
@xwjiang2010 , I think that the assumption here is that autoscaler starts with the cluster, but autoscaler can sometimes take a few minutes to start (in some cases), wouldn't this crash? |
@AmeerHajAli A hard crash will only be for "no autoscaler" case. For autoscaler case, since we don't know definitively what is specified in cluter.yaml file, we only print the following msg every 60s by default:
We can elongate 60s for sure. Let me know what is a good number from your experience with autoscaler. |
When autoscaler is disabled, raise Error when there are insufficient resources.
Why are these changes needed?
When autoscaler is disabled, raise Error when there are insufficient resources.
Tested by running on a local cluster on my laptop.
Sample output:
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.