[train] enable train + tune tests and examples #39021

matthewdeng · 2023-08-28T20:10:44Z

Why are these changes needed?

Combines :steam_locomotive: :octopus: Train + Tune tests and examples and :steam_locomotive: :octopus: :floppy_disk: New persistence mode: Train + Tune tests and examples into one suite again that always has the new persistence path.
Enable the tests.
Fixes the test to use RAY_AIR_LOCAL_CACHE_DIR instead of RunConfig.local_dir.

Related issue number

Checks

Successful run here.

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <matt@anyscale.com>

justinvyu

Thanks! lgtm

Signed-off-by: Matthew Deng <matt@anyscale.com>

@justinvyu

* [train] enable new persistence mode for core and serve tests (#38938) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] remove non-URI assertion (#38944) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Co-authored-by: Matthew Deng <matt@anyscale.com> * disable legacy sync config logic in trainable (#38952) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][6/n] 📖 ✈️ Ray AIR examples (#38918) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][1/n] 📺 ✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests (#38903) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] Fix broken tune tests and support ray storage (#38950) This PR re-introduces support for ray storage ray.init(storage="s3://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train][rllib] temporarily disable new persistence mode for rllib tests (#38965) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7 CI][New Persistent Mode][8/n] ✈️ AIR tests (ray/air) (#38932) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] Fix MosaicTrainer example and unit test (#38970) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [air/release] Fix dreambooth example image preprocessing logic (#39020) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] clean up ray.train._checkpoint imports (#38951) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] high level cleanup of Ray Train docs (#38971) Signed-off-by: Matthew Deng <matt@anyscale.com> * [wip][docs] update FrameworkPredictor examples (#38634) Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> * [train] Add documentation for using metadata argument to save preprocessors (#38701) * [Train] Restructure Ray Train Example Page (#38814) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [tune/storage] Fix Tune multinode tests (#39050) Fixes multinode tests by using the new train.report() API. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix BOHB example for new storage (#38983) The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <kai@anyscale.com> * [Release Test] Fix `long_running_horovod_tune_test`. (#39012) Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> * [train] New persistence mode: `StorageContext` unit tests (#39023) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] enable train + tune tests and examples (#39021) Signed-off-by: Matthew Deng <matt@anyscale.com> * [rllib] Fix storage-path related tests (#38947) This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081) Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matt@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Signed-off-by: Matthew Deng <matt@anyscale.com>

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

matthewdeng added 4 commits August 28, 2023 13:10

[train] enable train + tune tests and examples

a8a2229

Signed-off-by: Matthew Deng <matt@anyscale.com>

format

04bd93c

Signed-off-by: Matthew Deng <matt@anyscale.com>

newline to trigger tests

e5f578e

Signed-off-by: Matthew Deng <matt@anyscale.com>

use RAY_AIR_LOCAL_CACHE_DIR

7cd783b

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng assigned justinvyu and krfricke Aug 29, 2023

matthewdeng added v2.7.0-pick tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 29, 2023

matthewdeng marked this pull request as ready for review August 29, 2023 02:32

justinvyu approved these changes Aug 29, 2023

View reviewed changes

matthewdeng merged commit 3e5c61a into ray-project:master Aug 29, 2023

matthewdeng added a commit to matthewdeng/ray that referenced this pull request Aug 30, 2023

[train] enable train + tune tests and examples (ray-project#39021)

840986d

Signed-off-by: Matthew Deng <matt@anyscale.com>

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[train] enable train + tune tests and examples (ray-project#39021)

9980670

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023

[train] enable train + tune tests and examples (ray-project#39021)

702e4e5

Signed-off-by: Matthew Deng <matt@anyscale.com>

jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023

[train] enable train + tune tests and examples (ray-project#39021)

4bd62dc

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[train] enable train + tune tests and examples (ray-project#39021)

65b289b

Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] enable train + tune tests and examples #39021

[train] enable train + tune tests and examples #39021

Uh oh!

matthewdeng commented Aug 28, 2023 •

edited

Loading

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

[train] enable train + tune tests and examples #39021

[train] enable train + tune tests and examples #39021

Uh oh!

Conversation

matthewdeng commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthewdeng commented Aug 28, 2023 •

edited

Loading