-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train][doc] Configuring persistent storage user guide #39428
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
|
||
If you save checkpoints with :meth:`ray.train.report(..., checkpoint=...) <ray.train.report>` | ||
and run on a multi-node cluster, Ray Train will raise an error if NFS or cloud storage is not setup. | ||
This is because Ray Train expects all workers to be able to write the checkpoint to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of Ray Train expects all workers to be able to write the checkpoint to the same persistent storage location.
, that should be the core concept of persistent story. What about moving it to the very top of this user guide?
But logically using head node as persistent storage is also the same persistent storage location
. The reason why we no longer support it is because of network communication bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the deprecation of head node as persistnet storage is described in the issue.
When providing a custom filesystem, the associated ``storage_path`` is expected | ||
to be a qualified filesystem path *without the protocol prefix*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we check this and raise an error in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, pyarrow will raise an error directly
network device, such as NFS. | ||
.. code-block:: text | ||
|
||
s3://bucket-name/sub-path (RunConfig.storage_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so nice!
…persistence Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…9428) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…39515) * [docs] update doc links in Use Cases (#39445) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train][release] Fix `tune_worker_fault_tolerance` release test node killing (#39233) Signed-off-by: Justin Yu <justinvyu@anyscale.com> * [train] Do not raise warning when no results were reported (#39454) Signed-off-by: Kai Fricke <kai@anyscale.com> * [train] add diagram in overview page (#39512) Signed-off-by: Matthew Deng <matt@anyscale.com> * [train] New persistence mode: Sanity-check release test (#39354) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train][doc] Configuring persistent storage user guide (#39428) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> * [train][doc] New checkpointing user guide (#39505) Signed-off-by: Justin Yu <justinvyu@anyscale.com> --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
…9428) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…9428) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR updates the persistent storage user guide to reflect the improvements in 2.7.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.