-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train/docs] Extend resource guide (training backend + choosing resources) #39202
[train/docs] Extend resource guide (training backend + choosing resources) #39202
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
|
||
.. _train-resource-guide: | ||
|
||
How many nodes, workers, and resources should I use? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good information but I feel like it's really scattered right now and as a user it would be very hard to figure out what to concretely do. It's generally not clear to me if this is intended to be more catered to people who are starting distributed training, or people who already have a job and are facing concrete bottlenecks (maybe we want to apply this to both categories).
Sorry if this comment isn't super concrete, maybe we can separate this from the PR for now and think more how to architect this information.
Signed-off-by: Kai Fricke <kai@anyscale.com>
…oc/train/resource-guide
The :class:`Trainer <ray.train.trainer.BaseTrainer>` object you instantiate in the | ||
training script contains the settings to run your training. When you call | ||
:meth:`Trainer.fit() <ray.train.trainer.BaseTrainer.fit>`, it will be scheduled | ||
as a :ref:`Ray Actor <actor-key-concept>`. It can then also use resources. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not able to leave the comment on the exact lines since it was not edited in this PR, but I feel like the code snippet should just be moved to an example in the API reference. 1. I feel like we're missing an example in the API reference. 2. I don't think we should include more advanced confgs like placement_strategy
in the introduction of this guide.
Signed-off-by: Kai Fricke <kai@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaling/GPUs guide looks good after the suggested changes. Will let @justinvyu take another pass for the other pages!
…e-guide # Conflicts: # doc/source/train/api/api.rst # python/ray/air/config.py
…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
…39468) * [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158) Resolves three issues that come up when migrating the `tune_cifar_torch_pbt_example` from Ray 2.6 to Ray 2.7: 1. There is a warning message because PBT uses the `_schedule_trial_save` interface. This is added to the white list attributes so it doesn't come up anymore. 2. PBT malfunctions in Python 2.7, so instead of silently failing, we raise an error and ask users to migrate 3. When users use old `ray.air.Checkpoint` APIs on `ray.train.Checkpoint`, we should raise an actionable error message. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Make Trainable.save/restore developer APIs (#39391) Signed-off-by: Kai Fricke <kai@anyscale.com> * [Telemetry] Add Telemetry for Ray Train Utilities (#39363) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] update Train API references & annotations (#39294) Signed-off-by: Matthew Deng <matt@anyscale.com> * [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train] remove _max_cpu_fraction_per_node (#39412) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [Train][Telemetry] Limit the usage of `ray.train.torch.get_device`. (#39432) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> * [train-ci] Fix Train examples with authentication buildkite commands. (#39387) * [train-ci] fix Train examples with authentication buildkite commands. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [train][doc] Remove preprocessor reference in tune+train user guide (#39442) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train/docs] Extend resource guide (training backend + choosing resources) (#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docs Signed-off-by: Matthew Deng <matt@anyscale.com> * [Minor] Remove remaining LightningTrainer Mentions (#39441) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> --------- Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Yunxuan Xiao <yunxuanx@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…rces) (ray-project#39202) Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This docs update adds a comprehensive guide to choosing resources for distributed training. Additionally, it touches setting the distributed communication backend in torch, configuring persistence storage via an environment variable, and expands on checkpoint restoration. The PR also fixes a few references within existing checkpointing docs.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.