[train] Fix broken tune tests and support ray storage #38950

justinvyu · 2023-08-26T19:39:03Z

Why are these changes needed?

This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…pens Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-26T19:42:00Z

python/ray/tune/execution/tune_controller.py

+            # If a decision is already cached, don't override it for CONTINUE/NOOP
+            # decisions. Only escalate the cached decision to a STOP/PAUSE if requested.
+            # This is only very relevant for pausing trials, since we cache a PAUSE
+            # decision to happen after a save finishes.
+            # We need to make sure that we don't override it in the
+            # time between the save operation starting and finishing.
+            if decision in [TrialScheduler.STOP, TrialScheduler.PAUSE]:
+                self._cached_trial_decisions[trial.trial_id] = decision


this is pretty confusing -- we do pause + save checkpoint as 2 steps:

first time we try to schedule a pause, we schedule a save instead. then we CACHE a pause decision.

then, once the save finishes, we pop the cached decision and execute it, which should be a PAUSE again, in which case we enter the should_checkpoint=False condition and just stop the trial and set the status to PAUSED.

In between these 2 steps, if the scheduler outputs some decision that's not PAUSED (ex: NOOP), then this thing will just hang forever. So we need to make sure that the cached PAUSE decision is not overriden by something random while the save is happening.

I allow STOP decisions to override, but unclear if this ever happens..

Hm, the _cached_trial_decision is only updated on saves. The scheduler actions only affect _queued_trial_decision. (I know the naming is confusing... happy to rename it).

Does this still come up? Is this like a double triggered save? In that case, should we have this line:

if trial.temporary_state.saving_to: # If a save is already in progress, don't schedule another one. return trial.temporary_state.saving_to

in the new storage path _schedule_trial_save as well?

I see, you're right. The problem is a little different from what I described above. Here's what's actually happening to cause that unit test to fail:

The test defines a scheduler that manually calls tune_controller.pause_trial(trial) during the on_trial_result hook

should_checkpoint=True by default, so this will schedule a SAVE and set the cached trial decision here:

ray/python/ray/tune/execution/tune_controller.py

Line 1631 in 3984b85

self._cached_trial_decisions[trial.trial_id] = TrialScheduler.PAUSE

At this point, we return a NOOP trial scheduler decision (since we paused manually) and end up here:

ray/python/ray/tune/execution/tune_controller.py

Lines 1767 to 1768 in 3984b85

decision = self._scheduler_alg.on_trial_result(

self._wrapped(), trial, flat_result

The trial IS SAVING at this point, so we enter this block:

ray/python/ray/tune/execution/tune_controller.py

Lines 1806 to 1820 in 3984b85

if trial.is_saving:

logger.debug(f"Caching trial decision for trial {trial}: {decision}")

# Cache decision to execute on after the save is processed.

# This prevents changing the trial's state or kicking off

# another training step prematurely.

# If a decision is already cached, don't override it for CONTINUE/NOOP

# decisions. Only escalate the cached decision to a STOP/PAUSE if requested.

# This is only very relevant for pausing trials, since we cache a PAUSE

# decision to happen after a save finishes.

# We need to make sure that we don't override it in the

# time between the save operation starting and finishing.

if decision in [TrialScheduler.STOP, TrialScheduler.PAUSE]:

self._cached_trial_decisions[trial.trial_id] = decision

return None

We overwrite the PAUSE cached decision with the NOOP, leading to an infinite hang.

The conclusion: Calling pause_trial(should_checkpoint=True) directly inside a scheduler's on_trial_result leads to a hang.

PBT doesn't run into this problem because it calls pause_trial(should_checkpoint=False)

justinvyu · 2023-08-26T19:42:50Z

python/ray/tune/tests/execution/test_controller_resources_integration.py

+                trial.set_status(Trial.PAUSED)
                trial.update_resources(dict(cpu=4, gpu=0))
+                trial.set_status(orig_status)
+


this thing errors since it's still RUNNING (even if you put it after pause_trial)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…broken_ci

justinvyu · 2023-08-26T23:52:00Z

python/ray/tune/tests/execution/test_controller_resources_integration.py

+                # NOTE: This is a hack to get around the new pausing logic,
+                # which doesn't set the trial status to PAUSED immediately.
+                orig_status = trial.status
+                trial.set_status(Trial.PAUSED)
                trial.update_resources(dict(cpu=4, gpu=0))
+                trial.set_status(orig_status)
+                return TrialScheduler.PAUSE


I've reverted the tune controller changes and updated the test to pass, but calling pause_trial within on_trial_result is still an issue.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

@justinvyu

* [train] enable new persistence mode for core and serve tests (#38938) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] remove non-URI assertion (#38944) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Co-authored-by: Matthew Deng <matt@anyscale.com> * disable legacy sync config logic in trainable (#38952) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][6/n] 📖 ✈️ Ray AIR examples (#38918) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][1/n] 📺 ✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests (#38903) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] Fix broken tune tests and support ray storage (#38950) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train][rllib] temporarily disable new persistence mode for rllib tests (#38965) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7 CI][New Persistent Mode][8/n] ✈️ AIR tests (ray/air) (#38932) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] Fix MosaicTrainer example and unit test (#38970) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [air/release] Fix dreambooth example image preprocessing logic (#39020) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] clean up ray.train._checkpoint imports (#38951) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] high level cleanup of Ray Train docs (#38971) Signed-off-by: Matthew Deng <matt@anyscale.com> * [wip][docs] update FrameworkPredictor examples (#38634) Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> * [train] Add documentation for using metadata argument to save preprocessors (#38701) * [Train] Restructure Ray Train Example Page (#38814) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [tune/storage] Fix Tune multinode tests (#39050) Fixes multinode tests by using the new train.report() API. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix BOHB example for new storage (#38983) The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <kai@anyscale.com> * [Release Test] Fix `long_running_horovod_tune_test`. (#39012) Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] New persistence mode: `StorageContext` unit tests (#39023) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] enable train + tune tests and examples (#39021) Signed-off-by: Matthew Deng <matt@anyscale.com> * [rllib] Fix storage-path related tests (#38947) This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081) Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>

) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 5 commits August 26, 2023 11:15

ray storage

b7fabe4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tune controller resources test

792dcb3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

stop decisions should probably overrule pause... unclear how this hap…

8e48a18

…pens Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skip ray storage for now

bd1e7f1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

3984b85

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from ericl, matthewdeng and krfricke August 26, 2023 19:39

justinvyu assigned matthewdeng and krfricke Aug 26, 2023

justinvyu commented Aug 26, 2023

View reviewed changes

justinvyu added 2 commits August 26, 2023 15:50

fix test more minimally

88d4e8f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

720666b

…broken_ci

justinvyu commented Aug 26, 2023

View reviewed changes

fix lint

c1c9e0c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 27, 2023

matthewdeng added the v2.7.0-pick label Aug 27, 2023

krfricke approved these changes Aug 27, 2023

View reviewed changes

krfricke merged commit b19bcab into ray-project:master Aug 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] Fix broken tune tests and support ray storage #38950

[train] Fix broken tune tests and support ray storage #38950

Uh oh!

justinvyu commented Aug 26, 2023 •

edited

Loading

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

krfricke Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023 •

edited

Loading

Uh oh!

justinvyu Aug 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

	decision = self._scheduler_alg.on_trial_result(
	self._wrapped(), trial, flat_result

	if trial.is_saving:
	logger.debug(f"Caching trial decision for trial {trial}: {decision}")
	# Cache decision to execute on after the save is processed.
	# This prevents changing the trial's state or kicking off
	# another training step prematurely.

	# If a decision is already cached, don't override it for CONTINUE/NOOP
	# decisions. Only escalate the cached decision to a STOP/PAUSE if requested.
	# This is only very relevant for pausing trials, since we cache a PAUSE
	# decision to happen after a save finishes.
	# We need to make sure that we don't override it in the
	# time between the save operation starting and finishing.
	if decision in [TrialScheduler.STOP, TrialScheduler.PAUSE]:
	self._cached_trial_decisions[trial.trial_id] = decision
	return None

[train] Fix broken tune tests and support ray storage #38950

[train] Fix broken tune tests and support ray storage #38950

Uh oh!

Conversation

justinvyu commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

krfricke Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu commented Aug 26, 2023 •

edited

Loading

justinvyu Aug 26, 2023 •

edited

Loading

justinvyu Aug 26, 2023 •

edited

Loading