-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration #38895
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
if not self.storage_filesystem or isinstance( | ||
self.storage_filesystem, pyarrow.fs.LocalFileSystem | ||
): | ||
self.storage_local_path = self.storage_path | ||
else: | ||
self.storage_local_path = default_results_dir | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinvyu lmk what you think. Previously the storage_local_path
was always the ray_results directory. But I think we should actually set it to the local path if a local path was set. This also matches previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this. @ericl did not want ANY parsing of the storage path (i.e detecting if it's remote or not).
But, because this is AFTER pyrarrow.fs.from_uri
, and we're just doing a typecheck of the pyarrow fs, I think this is ok.
The difference is that previously, we did some checks on the raw uri that was being passed by the user.
This will also remove the need for us to either re-introduce some kind of local_dir
param or make the environment variable more visible.
storage_path
is the 1 way to set the experiment path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! One fix, and some other comments
if not self.storage_filesystem or isinstance( | ||
self.storage_filesystem, pyarrow.fs.LocalFileSystem | ||
): | ||
self.storage_local_path = self.storage_path | ||
else: | ||
self.storage_local_path = default_results_dir | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this. @ericl did not want ANY parsing of the storage path (i.e detecting if it's remote or not).
But, because this is AFTER pyrarrow.fs.from_uri
, and we're just doing a typecheck of the pyarrow fs, I think this is ok.
The difference is that previously, we did some checks on the raw uri that was being passed by the user.
This will also remove the need for us to either re-introduce some kind of local_dir
param or make the environment variable more visible.
storage_path
is the 1 way to set the experiment path.
python/ray/tune/trainable/session.py
Outdated
# Backwards compatibility | ||
from ray import train | ||
|
||
report_dict = kwargs | ||
if _metric is not None: | ||
report_dict["_metric"] = _metric | ||
|
||
train.report(report_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just force users to switch to train.report? otherwise we'll have to force users to do another migration post 2.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, at minimum add a warning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a deprecation warning for now. Let's check if we can hard deprecate in a follow-up
if not self.storage_filesystem or isinstance( | ||
self.storage_filesystem, pyarrow.fs.LocalFileSystem | ||
): | ||
self.storage_local_path = self.storage_path | ||
else: | ||
self.storage_local_path = default_results_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self.storage_filesystem or isinstance( | |
self.storage_filesystem, pyarrow.fs.LocalFileSystem | |
): | |
self.storage_local_path = self.storage_path | |
else: | |
self.storage_local_path = default_results_dir | |
if isinstance( | |
self.storage_filesystem, pyarrow.fs.LocalFileSystem | |
): | |
self.storage_local_path = self.storage_path | |
else: | |
self.storage_local_path = default_results_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise this will set local path to a s3 uri in the case of
storage_path="s3://...", storage_filesystem=None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, wouldn't pyarrow resolve the filesystem to s3 filesystem anyway? this is checking self.storage_filesystem
after parsing, not storage_filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[flagging as a major API change]
It's super late in the release process, and we cannot make breaking API changes like this that go against choices previously made in the REP.
Let's revert the behavior changes from this PR and we can discuss them separately.
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/BUILD # python/ray/tune/tests/test_api.py # python/ray/tune/tests/test_cluster.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Ok, reverted the changes for now. It also looks like test_cluster was fixed in another PR, so this leaves only two test suites in this PR. |
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
python/ray/tune/BUILD
Outdated
@@ -59,7 +59,7 @@ py_test( | |||
size = "large", | |||
srcs = ["tests/test_api.py"], | |||
deps = [":tune_lib"], | |||
tags = ["team:ml", "exclusive", "rllib"], | |||
tags = ["team:ml", "exclusive", "rllib", "new_storage"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/for future reference - we don't need this if we're enabling the FF (or is this test run in multiple suites?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are disabled in the old path but enabled in the new path. When the FF is enabled for everything (and not overwritten) we don't need it anymore.
Let's clean those up in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, just to confirm my understanding - these tests are run in other suites in addition to ":octopus: :brain: Tune tests and examples {using RLlib}"
?
|
||
register_trainable("f1", train_fn) | ||
with unittest.mock.patch.dict(os.environ, {"TEST_TMPDIR": logdir}): | ||
tune.run("f1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test this supposed to test run_experiments
specifically? Or does it need fail_fast="raise"
in order for the test failure to be propagated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just testing the logdir location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh woops, for some reason I was under the impression that Tune would swallow the error and tune.run
would finish "successfully" with an error file.
@@ -64,7 +67,7 @@ def testTuneRestore(self): | |||
name="TuneRestoreTest", | |||
stop={"training_iteration": 2}, # train one more iteration. | |||
checkpoint_config=CheckpointConfig(checkpoint_frequency=1), | |||
restore=self.checkpoint_path, # Restore the checkpoint | |||
restore=self.checkpoint_parent, # Restore the checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinvyu to confirm that this is the intended behavior - restore
takes in the directory instead of the file? Seems logical, but are there backwards compatibility issues with this?
@krfricke I'm going to merge in master to re-trigger the tests. (I took off the |
…ay-project#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com>
* [train] enable new persistence mode for core and serve tests (#38938) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] remove non-URI assertion (#38944) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Co-authored-by: Matthew Deng <matt@anyscale.com> * disable legacy sync config logic in trainable (#38952) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][6/n] 📖✈️ Ray AIR examples (#38918) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][1/n] 📺✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests (#38903) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] Fix broken tune tests and support ray storage (#38950) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train][rllib] temporarily disable new persistence mode for rllib tests (#38965) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7 CI][New Persistent Mode][8/n]✈️ AIR tests (ray/air) (#38932) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] Fix MosaicTrainer example and unit test (#38970) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [air/release] Fix dreambooth example image preprocessing logic (#39020) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] clean up ray.train._checkpoint imports (#38951) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] high level cleanup of Ray Train docs (#38971) Signed-off-by: Matthew Deng <matt@anyscale.com> * [wip][docs] update FrameworkPredictor examples (#38634) Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> * [train] Add documentation for using metadata argument to save preprocessors (#38701) * [Train] Restructure Ray Train Example Page (#38814) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [tune/storage] Fix Tune multinode tests (#39050) Fixes multinode tests by using the new train.report() API. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix BOHB example for new storage (#38983) The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <kai@anyscale.com> * [Release Test] Fix `long_running_horovod_tune_test`. (#39012) Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] New persistence mode: `StorageContext` unit tests (#39023) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] enable train + tune tests and examples (#39021) Signed-off-by: Matthew Deng <matt@anyscale.com> * [rllib] Fix storage-path related tests (#38947) This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081) Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
…ay-project#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…ay-project#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…ay-project#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…ay-project#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Migrates failing tests of this suite: https://buildkite.com/ray-project/oss-ci-build-pr/builds/33691#018a2b92-326f-4336-9c75-4f358721c816
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.