[train/docs] Extend resource guide (training backend + choosing resources) #39202

krfricke · 2023-09-01T10:14:38Z

Why are these changes needed?

This docs update adds a comprehensive guide to choosing resources for distributed training. Additionally, it touches setting the distributed communication backend in torch, configuring persistence storage via an environment variable, and expands on checkpoint restoration. The PR also fixes a few references within existing checkpointing docs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

doc/source/train/user-guides/using-gpus.rst

matthewdeng · 2023-09-05T17:33:26Z

doc/source/train/user-guides/using-gpus.rst

+
+.. _train-resource-guide:
+
+How many nodes, workers, and resources should I use?


This is good information but I feel like it's really scattered right now and as a user it would be very hard to figure out what to concretely do. It's generally not clear to me if this is intended to be more catered to people who are starting distributed training, or people who already have a job and are facing concrete bottlenecks (maybe we want to apply this to both categories).

Sorry if this comment isn't super concrete, maybe we can separate this from the PR for now and think more how to architect this information.

Signed-off-by: Kai Fricke <kai@anyscale.com>

…oc/train/resource-guide

doc/source/train/user-guides/persistent-storage.rst

doc/source/train/user-guides/checkpoints.rst

Signed-off-by: Kai Fricke <kai@anyscale.com>

doc/source/train/user-guides/using-gpus.rst

matthewdeng · 2023-09-06T20:50:29Z

doc/source/train/user-guides/using-gpus.rst

+The :class:`Trainer <ray.train.trainer.BaseTrainer>` object you instantiate in the
+training script contains the settings to run your training. When you call
+:meth:`Trainer.fit() <ray.train.trainer.BaseTrainer.fit>`, it will be scheduled
+as a :ref:`Ray Actor <actor-key-concept>`. It can then also use resources.



Not able to leave the comment on the exact lines since it was not edited in this PR, but I feel like the code snippet should just be moved to an example in the API reference. 1. I feel like we're missing an example in the API reference. 2. I don't think we should include more advanced confgs like placement_strategy in the introduction of this guide.

doc/source/train/user-guides/using-gpus.rst

Signed-off-by: Kai Fricke <kai@anyscale.com>

matthewdeng

Scaling/GPUs guide looks good after the suggested changes. Will let @justinvyu take another pass for the other pages!

doc/source/train/user-guides/using-gpus.rst

python/ray/air/config.py

doc/source/train/api/api.rst

doc/source/train/images/train_cluster_overview.png

…e-guide # Conflicts: # doc/source/train/api/api.rst # python/ray/air/config.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

…39468) * [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158) Resolves three issues that come up when migrating the `tune_cifar_torch_pbt_example` from Ray 2.6 to Ray 2.7: 1. There is a warning message because PBT uses the `_schedule_trial_save` interface. This is added to the white list attributes so it doesn't come up anymore. 2. PBT malfunctions in Python 2.7, so instead of silently failing, we raise an error and ask users to migrate 3. When users use old `ray.air.Checkpoint` APIs on `ray.train.Checkpoint`, we should raise an actionable error message. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Make Trainable.save/restore developer APIs (#39391) Signed-off-by: Kai Fricke <kai@anyscale.com> * [Telemetry] Add Telemetry for Ray Train Utilities (#39363) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] update Train API references & annotations (#39294) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] remove _max_cpu_fraction_per_node (#39412) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [Train][Telemetry] Limit the usage of `ray.train.torch.get_device`. (#39432) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train-ci] Fix Train examples with authentication buildkite commands. (#39387) * [train-ci] fix Train examples with authentication buildkite commands. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [train][doc] Remove preprocessor reference in tune+train user guide (#39442) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train/docs] Extend resource guide (training backend + choosing resources) (#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docs Signed-off-by: Matthew Deng <matt@anyscale.com> * [Minor] Remove remaining LightningTrainer Mentions (#39441) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> --------- Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>

…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke added 6 commits September 1, 2023 12:13

extend resource guide

50554d4

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix image link

422a2bd

Signed-off-by: Kai Fricke <kai@anyscale.com>

persistence storage

60c1eca

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix typos

a5a3bdd

Signed-off-by: Kai Fricke <kai@anyscale.com>

cross-link

7f04e50

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix references

e51cbce

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke marked this pull request as ready for review September 1, 2023 14:37

krfricke requested review from richardliaw, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners September 1, 2023 14:37

krfricke assigned matthewdeng Sep 1, 2023

render

62e4706

matthewdeng reviewed Sep 5, 2023

View reviewed changes

Kai Fricke added 2 commits September 6, 2023 15:52

update headlines

55178d7

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'origin/doc/train/resource-guide' into d…

f8a47dc

…oc/train/resource-guide

justinvyu reviewed Sep 6, 2023

View reviewed changes

doc/source/train/user-guides/persistent-storage.rst Outdated Show resolved Hide resolved

doc/source/train/user-guides/checkpoints.rst Outdated Show resolved Hide resolved

thin

996aaa1

Signed-off-by: Kai Fricke <kai@anyscale.com>

matthewdeng reviewed Sep 6, 2023

View reviewed changes

Kai Fricke added 3 commits September 7, 2023 14:05

review

edcac0e

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into doc/train/resource-guide

38487a9

move to example in api

1a9eac7

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested review from justinvyu and matthewdeng September 7, 2023 14:47

krfricke added the v2.7.0-pick label Sep 7, 2023

matthewdeng reviewed Sep 8, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into doc/train/resourc…

27f97c2

…e-guide # Conflicts: # doc/source/train/api/api.rst # python/ray/air/config.py

review

b8fab6c

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 8, 2023

Update using-gpus.rst

00e5713

matthewdeng approved these changes Sep 8, 2023

View reviewed changes

justinvyu approved these changes Sep 8, 2023

View reviewed changes

matthewdeng merged commit 8190332 into ray-project:master Sep 8, 2023

krfricke deleted the doc/train/resource-guide branch September 11, 2023 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train/docs] Extend resource guide (training backend + choosing resources) #39202

[train/docs] Extend resource guide (training backend + choosing resources) #39202

krfricke commented Sep 1, 2023 •

edited

Loading

matthewdeng Sep 5, 2023

matthewdeng Sep 6, 2023

matthewdeng left a comment


		.. _train-resource-guide:

		How many nodes, workers, and resources should I use?

[train/docs] Extend resource guide (training backend + choosing resources) #39202

[train/docs] Extend resource guide (training backend + choosing resources) #39202

Conversation

krfricke commented Sep 1, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng Sep 5, 2023

Choose a reason for hiding this comment

matthewdeng Sep 6, 2023

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

krfricke commented Sep 1, 2023 •

edited

Loading