Add a tutorial for ProcessGroup extensions #1798

mrshenli · 2022-01-20T04:19:37Z

No description provided.

mrshenli · 2022-01-20T04:20:21Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+Customize Process Group Backends Using Cpp Extensions
+=====================================================
+
+**Author**: `Feng Tian <https://github.com/ftian1>`__, `Shen Li <https://mrshenli.github.io/>`_-


Hey @ftian1, do you mind if I add you as the first author of this tutorial? BTW, thanks a lot for contributing this feature!

no problem:) It's my pleasure to be in here

netlify · 2022-01-20T04:26:16Z

✔️ Deploy Preview for pytorch-tutorials-preview ready!

🔨 Explore the source changes: be4f14f

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/61f97d8ed1647e00088a89a1

😎 Browse the preview: https://deploy-preview-1798--pytorch-tutorials-preview.netlify.app

rohan-varma · 2022-01-23T09:32:03Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+communication algorithms (e.g.,
+`Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud>`__,
+`Reduction Server <https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai>`__).
+Therefore, the distributed package exposed extension APIs to allow customizing


exposed -> exposes?

rohan-varma · 2022-01-23T09:34:33Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+                future_(std::move(future)) {}
+            bool isCompleted() override;
+            bool isSuccess() const override;
+            bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;


Is it fine for tutorial purposes for these to not be implemented anywhere in the tutorial?

good point, let me remove those and add a comment to mention full implementation is in the repo

rohan-varma · 2022-01-23T09:35:35Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+**Author**: `Feng Tian <https://github.com/ftian1>`__, `Shen Li <https://mrshenli.github.io/>`__
+
+
+Prerequisites:


Should we also add "cpp extensions" as a prerequisite?

ah, yes, let me add that

rohan-varma · 2022-01-23T09:39:25Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+    import os
+
+    import torch
+    import dummy_collectives


Add a comment specifying this is what imports the "dummy collectives" cpp extension and makes "dummy" backend available? I missed it at first and was wondering how the "dummy" name gets recognized but it seems to be through this.

rohan-varma

LGTM, thanks for putting this together! A couple minor suggestions/comments

rohan-varma · 2022-01-23T09:44:06Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+            py::object module = py::module::import("torch.distributed");
+            py::object register_backend =
+                module.attr("Backend").attr("register_backend");
+            register_backend("dummy", py::cpp_function(createProcessGroupDummy));


Mention that this calls torch.distributed.register_backend which is how torch.distributed will recognize it as a valid backend?

rohan-varma · 2022-01-23T09:45:20Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+(e.g., `TPU <https://cloud.google.com/tpu>`__,
+`Trainum <https://aws.amazon.com/machine-learning/trainium/>`__), and emerging
+communication algorithms (e.g.,
+`Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud>`__,


For Herring and Reduction server, is the best way to achieve implementing it in PyTorch through custom cpp extension or do we want to encourage users to build on top of the existing torch.distributed collectives which should be able to enable these algorithms on top of nccl or gloo?

The most efficient way would be doing it directly in the communication layer, i.e., through c10d extension. This is also how Fairing (not sure about Herring as it doesn't seem it's open source) and Reduction Server are implemented today (Faring uses c10d extension, and Reduction Server is NCCL plugin). Since the goal of those algorithms is to bump up comm efficiency, I would assume future users would follow similar paths, unless the algorithm is powerful enough to shine even with an inefficient implementation.

rohan-varma · 2022-01-23T09:47:40Z

intermediate_source/process_group_cpp_extension_tutorial.rst

+training features, including
+`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
+`ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer>`__,
+`FullyShardedDataParallel <https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py>`__,.


the ",." at the end of the link -> "."

mrshenli · 2022-01-24T15:28:32Z

Hey @brianjo, the content for this tutorial is ready to be merged. The failure on "pytorch_tutorial_pr_build_manager" look irrelevant? Is so, share we merge? Thanks!

brianjo · 2022-01-24T16:01:40Z

Its needs to pass tests or it will break the build. I'll take a look today. Thanks!

facebook-github-bot added the cla signed label Jan 20, 2022

mrshenli commented Jan 20, 2022

View reviewed changes

mrshenli force-pushed the c10d_extension branch 4 times, most recently from f5ae99d to 689a59e Compare January 23, 2022 02:39

mrshenli changed the title ~~[WIP] Add tutorial for ProcessGroup extensions~~ Add tutorial for ProcessGroup extensions Jan 23, 2022

mrshenli changed the title ~~Add tutorial for ProcessGroup extensions~~ Add a tutorial for ProcessGroup extensions Jan 23, 2022

rohan-varma reviewed Jan 23, 2022

View reviewed changes

rohan-varma approved these changes Jan 23, 2022

View reviewed changes

rohan-varma reviewed Jan 23, 2022

View reviewed changes

Add tutorial for ProcessGroup extensions

17f05f7

mrshenli force-pushed the c10d_extension branch from 689a59e to 17f05f7 Compare January 24, 2022 04:02

Merge branch 'master' into c10d_extension

4ff69c1

Merge branch 'master' into c10d_extension

be4f14f

holly1238 merged commit 1db1c1c into pytorch:master Feb 1, 2022

		Author: `Feng Tian <https://github.com/ftian1>`__, `Shen Li <https://mrshenli.github.io/>`__


		Prerequisites:

Add a tutorial for ProcessGroup extensions #1798

Add a tutorial for ProcessGroup extensions #1798

Uh oh!

Conversation

mrshenli commented Jan 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netlify bot commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jan 24, 2022

Uh oh!

brianjo commented Jan 24, 2022

Uh oh!

Uh oh!

netlify bot commented Jan 20, 2022 •

edited

Loading