[rllib] Fix storage-path related tests #38947

krfricke · 2023-08-26T14:33:43Z

Why are these changes needed?

This PR fixes rllib-related tests that didn't pass changes related to the new storage context.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: matthewdeng <matt@anyscale.com>

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu

thanks for the fix!

justinvyu · 2023-08-26T17:16:55Z

rllib/algorithms/algorithm.py

@@ -262,7 +263,7 @@ class Algorithm(Trainable, AlgorithmBase):

    @staticmethod
    def from_checkpoint(
-        checkpoint: Union[str, Checkpoint, NewCheckpoint],
+        checkpoint: Union[str, Checkpoint, NewCheckpoint, _TrainingResult],


can we change the places where it's using the direct output of save to instead take in the path? I feel like Algorithm.restore is the only thing that should take in a training result, since it's paired with save

Makes sense, updated

justinvyu · 2023-08-26T17:17:26Z

rllib/train.py

+        if trial.checkpoint.path:
+            checkpoints.append(trial.checkpoint.path)


Suggested change

if trial.checkpoint.path:

checkpoints.append(trial.checkpoint.path)

if trial.checkpoint:

checkpoints.append(trial.checkpoint)

all we use this for is a print later on. nicer to include the fs in the string

Also not sure if this is necessary, but this suggestion would technically allow this to be backwards compatible (though I'm not sure what the output representation ends up looking like)

Signed-off-by: Kai Fricke <kai@anyscale.com>

matthewdeng · 2023-08-28T03:21:05Z

Leaving a reminder here to re-enable the new persistence path for relevant tests when merging master, since I disabled the flag in #38965

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-08-28T09:12:21Z

rllib/tests/test_ray_client.py

            tune.Tuner(
                tune.with_resources(
-                    experiment, ppo.PPO.default_resource_request(config)
+                    wrapped_experiment, ppo.PPO.default_resource_request(config)


Btw, I have no idea why this test started failing in this PR.

krfricke · 2023-08-28T09:12:40Z

Leaving a reminder here to re-enable the new persistence path for relevant tests when merging master, since I disabled the flag in #38965

Removed the flag in this PR, ptal

Signed-off-by: Kai Fricke <kai@anyscale.com>

# Conflicts: # python/ray/tune/tests/test_api.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

sven1977

LGTM, just one question: Did we change/fix the RLlib docs occurrences, where algo.save() is used (and assumed to return a path as a str)?

krfricke · 2023-08-29T15:13:25Z

LGTM, just one question: Did we change/fix the RLlib docs occurrences, where algo.save() is used (and assumed to return a path as a str)?

Good point. I've just taken a look and there's at least one occurrence where we haven't. Will update now!

Signed-off-by: Kai Fricke <kai@anyscale.com>

This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com>

@justinvyu

* [train] enable new persistence mode for core and serve tests (#38938) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] remove non-URI assertion (#38944) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Co-authored-by: Matthew Deng <matt@anyscale.com> * disable legacy sync config logic in trainable (#38952) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][6/n] 📖 ✈️ Ray AIR examples (#38918) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][1/n] 📺 ✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests (#38903) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] Fix broken tune tests and support ray storage (#38950) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train][rllib] temporarily disable new persistence mode for rllib tests (#38965) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7 CI][New Persistent Mode][8/n] ✈️ AIR tests (ray/air) (#38932) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] Fix MosaicTrainer example and unit test (#38970) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [air/release] Fix dreambooth example image preprocessing logic (#39020) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] clean up ray.train._checkpoint imports (#38951) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] high level cleanup of Ray Train docs (#38971) Signed-off-by: Matthew Deng <matt@anyscale.com> * [wip][docs] update FrameworkPredictor examples (#38634) Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> * [train] Add documentation for using metadata argument to save preprocessors (#38701) * [Train] Restructure Ray Train Example Page (#38814) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [tune/storage] Fix Tune multinode tests (#39050) Fixes multinode tests by using the new train.report() API. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix BOHB example for new storage (#38983) The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <kai@anyscale.com> * [Release Test] Fix `long_running_horovod_tune_test`. (#39012) Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] New persistence mode: `StorageContext` unit tests (#39023) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] enable train + tune tests and examples (#39021) Signed-off-by: Matthew Deng <matt@anyscale.com> * [rllib] Fix storage-path related tests (#38947) This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081) Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>

This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com>

This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke and others added 3 commits August 25, 2023 13:59

test_api, Durable

c12b5ef

Signed-off-by: Kai Fricke <kai@anyscale.com>

[rllib] Fix storage path tests

8643515

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into rllib/storage-tests

44de41f

Signed-off-by: matthewdeng <matt@anyscale.com>

matthewdeng added the v2.7.0-pick label Aug 26, 2023

Update tests

15cd1e3

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu reviewed Aug 26, 2023

View reviewed changes

Kai Fricke added 7 commits August 26, 2023 20:41

evaluate

91dcd19

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into rllib/storage-tests

627a4a5

restore from checkpoint

85b45b3

Signed-off-by: Kai Fricke <kai@anyscale.com>

eval

6995aad

Signed-off-by: Kai Fricke <kai@anyscale.com>

custom experiment

d5b2d58

Signed-off-by: Kai Fricke <kai@anyscale.com>

eval

23ba3ff

Signed-off-by: Kai Fricke <kai@anyscale.com>

eval from checkpoint path

dd2a315

Signed-off-by: Kai Fricke <kai@anyscale.com>

matthewdeng mentioned this pull request Aug 27, 2023

[train][rllib] temporarily disable new persistence mode for rllib tests #38965

Merged

8 tasks

krfricke marked this pull request as ready for review August 28, 2023 07:59

krfricke requested review from sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners August 28, 2023 07:59

Kai Fricke added 2 commits August 28, 2023 10:00

Merge remote-tracking branch 'upstream/master' into rllib/storage-tests

7dd026e

Fix ray client test

ee6e8f9

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke commented Aug 28, 2023

View reviewed changes

Kai Fricke added 3 commits August 28, 2023 14:44

restore from path

8276bcd

Signed-off-by: Kai Fricke <kai@anyscale.com>

checkpoint dir

0d4353c

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into rllib/storage-tests

c6be114

# Conflicts: # python/ray/tune/tests/test_api.py

fix merge conflict

a7e2ea2

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested review from justinvyu and matthewdeng August 29, 2023 08:21

sven1977 approved these changes Aug 29, 2023

View reviewed changes

doc updates

9a49f3d

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from a team as a code owner August 29, 2023 15:17

krfricke merged commit 2ffd7e4 into ray-project:master Aug 29, 2023

krfricke deleted the rllib/storage-tests branch August 29, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rllib] Fix storage-path related tests #38947

[rllib] Fix storage-path related tests #38947

Uh oh!

krfricke commented Aug 26, 2023 •

edited

Loading

Uh oh!

justinvyu left a comment

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

krfricke Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

justinvyu Aug 26, 2023

Uh oh!

matthewdeng Aug 26, 2023

Uh oh!

krfricke Aug 26, 2023

Uh oh!

matthewdeng commented Aug 28, 2023

Uh oh!

krfricke Aug 28, 2023

Uh oh!

krfricke commented Aug 28, 2023

Uh oh!

sven1977 left a comment

Uh oh!

krfricke commented Aug 29, 2023

Uh oh!

Uh oh!

		if trial.checkpoint.path:
		checkpoints.append(trial.checkpoint.path)

[rllib] Fix storage-path related tests #38947

[rllib] Fix storage-path related tests #38947

Uh oh!

Conversation

krfricke commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

krfricke Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

matthewdeng Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

krfricke Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

matthewdeng commented Aug 28, 2023

Uh oh!

krfricke Aug 28, 2023

Choose a reason for hiding this comment

Uh oh!

krfricke commented Aug 28, 2023

Uh oh!

sven1977 left a comment

Choose a reason for hiding this comment

Uh oh!

krfricke commented Aug 29, 2023

Uh oh!

Uh oh!

krfricke commented Aug 26, 2023 •

edited

Loading