-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add a tutorial for ProcessGroup extensions #1798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Customize Process Group Backends Using Cpp Extensions | ||
===================================================== | ||
|
||
**Author**: `Feng Tian <https://github.com/ftian1>`__, `Shen Li <https://mrshenli.github.io/>`_- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ftian1, do you mind if I add you as the first author of this tutorial? BTW, thanks a lot for contributing this feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem:) It's my pleasure to be in here
✔️ Deploy Preview for pytorch-tutorials-preview ready! 🔨 Explore the source changes: be4f14f 🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/61f97d8ed1647e00088a89a1 😎 Browse the preview: https://deploy-preview-1798--pytorch-tutorials-preview.netlify.app |
f5ae99d
to
689a59e
Compare
communication algorithms (e.g., | ||
`Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud>`__, | ||
`Reduction Server <https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai>`__). | ||
Therefore, the distributed package exposed extension APIs to allow customizing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exposed -> exposes?
future_(std::move(future)) {} | ||
bool isCompleted() override; | ||
bool isSuccess() const override; | ||
bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine for tutorial purposes for these to not be implemented anywhere in the tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, let me remove those and add a comment to mention full implementation is in the repo
**Author**: `Feng Tian <https://github.com/ftian1>`__, `Shen Li <https://mrshenli.github.io/>`__ | ||
|
||
|
||
Prerequisites: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add "cpp extensions" as a prerequisite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes, let me add that
import os | ||
|
||
import torch | ||
import dummy_collectives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment specifying this is what imports the "dummy collectives" cpp extension and makes "dummy" backend available? I missed it at first and was wondering how the "dummy" name gets recognized but it seems to be through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for putting this together! A couple minor suggestions/comments
py::object module = py::module::import("torch.distributed"); | ||
py::object register_backend = | ||
module.attr("Backend").attr("register_backend"); | ||
register_backend("dummy", py::cpp_function(createProcessGroupDummy)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that this calls torch.distributed.register_backend
which is how torch.distributed will recognize it as a valid backend?
(e.g., `TPU <https://cloud.google.com/tpu>`__, | ||
`Trainum <https://aws.amazon.com/machine-learning/trainium/>`__), and emerging | ||
communication algorithms (e.g., | ||
`Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud>`__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Herring and Reduction server, is the best way to achieve implementing it in PyTorch through custom cpp extension or do we want to encourage users to build on top of the existing torch.distributed collectives which should be able to enable these algorithms on top of nccl or gloo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most efficient way would be doing it directly in the communication layer, i.e., through c10d extension. This is also how Fairing (not sure about Herring as it doesn't seem it's open source) and Reduction Server are implemented today (Faring uses c10d extension, and Reduction Server is NCCL plugin). Since the goal of those algorithms is to bump up comm efficiency, I would assume future users would follow similar paths, unless the algorithm is powerful enough to shine even with an inefficient implementation.
training features, including | ||
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__, | ||
`ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer>`__, | ||
`FullyShardedDataParallel <https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py>`__,. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ",." at the end of the link -> "."
689a59e
to
17f05f7
Compare
Hey @brianjo, the content for this tutorial is ready to be merged. The failure on "pytorch_tutorial_pr_build_manager" look irrelevant? Is so, share we merge? Thanks! |
Its needs to pass tests or it will break the build. I'll take a look today. Thanks! |
No description provided.