1- Customize Process Group Backends Using Cpp Extensions
1+ Cpp ํ์ฅ์ ์ฌ์ฉํ์ฌ ํ๋ก์ธ์ค ๊ทธ๋ฃน ๋ฐฑ์๋ ์ฌ์ฉ์ ์ ์
22=====================================================
33
44**Author **: `Feng Tian <https://github.com/ftian1 >`__, `Shen Li <https://mrshenli.github.io/ >`__, `Min Si <https://minsii.github.io/ >`__
55
6+ **๋ฒ์ญ **: `๋ฐ์ฌ์ค <https://github.com/jenner9212 >`_
7+
68.. note ::
7- |edit | View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst >`__.
9+ |edit | ์ด ํํ ๋ฆฌ์ผ์ ์์ค ์ฝ๋๋ `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst >`__ ์์ ํ์ธํ๊ณ ๋ณ๊ฒฝํด ๋ณผ ์ ์์ต๋๋ค .
810
9- Prerequisites:
11+ ์ ์๊ณผ๋ชฉ( Prerequisites) :
1012
1113- `PyTorch Distributed Overview <../beginner/dist_overview.html >`__
1214- `PyTorch Collective Communication Package <https://pytorch.org/docs/stable/distributed.html >`__
1315- `PyTorch Cpp Extension <https://pytorch.org/docs/stable/cpp_extension.html >`__
1416- `Writing Distributed Applications with PyTorch <https://tutorials.pytorch.kr/intermediate/dist_tuto.html >`__
1517
16- This tutorial demonstrates how to implement a custom ``ProcessGroup ``
17- backend and plug that into
18- `PyTorch distributed package <https://pytorch.org/docs/stable/distributed.html >`__ using
19- `cpp extensions <https://pytorch.org/docs/stable/cpp_extension.html >`__. This is helpful when you need a specialized software
20- stack for your hardware, or when you would like to experiment with new
21- collective communication algorithms.
18+ ์ด ํํ ๋ฆฌ์ผ์ `cpp ํ์ฅ <https://pytorch.org/docs/stable/cpp_extension.html >`__ ์ ์ฌ์ฉํ์ฌ ์ฌ์ฉ์ ์ ์ ProcessGroup ๋ฐฑ์๋๋ฅผ ๊ตฌํํ๊ณ ์ด๋ฅผ `ํ์ดํ ์น ๋ถ์ฐ ํจํค์ง <https://pytorch.org/docs/stable/distributed.html >`__ ์ ์ฐ๊ฒฐํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค.
19+ ์ด๊ฒ์ ํ๋์จ์ด์ ํนํ๋ ์ํํธ์จ์ด ์คํ์ด ํ์ํ ๊ฒฝ์ฐ๋ ์๋ก์ด ์งํฉ ํต์ ์๊ณ ๋ฆฌ์ฆ์ ์คํํ๊ณ ์ ํ ๋ ์ ์ฉํฉ๋๋ค.
2220
2321
24- Basics
22+ ๊ธฐ์ด
2523------
2624
27- PyTorch collective communications power several widely adopted distributed
28- training features, including
29- `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html >`__,
30- `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer >`__,
31- `FullyShardedDataParallel <https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py >`__.
32- In order to make the same collective communication API work with
33- different communication backends, the distributed package abstracts collective
34- communication operations into a
25+ ํ์ดํ ์น ์งํฉ ํต์ ์
26+ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DistributedDataParallel) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html >`__,
27+ `์ ๋ก ๋ฆฌ๋๋์ ์ต์ ํ๊ธฐ(ZeroRedundancyOptimizer) <https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer >`__,
28+ `์์ ๊ณต์ ๋ฐ์ดํฐ ๋ณ๋ ฌ(FullyShardedDataParallel) <https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py >`__ ์ ํฌํจํ ๋๋ฆฌ ์ฌ์ฉ๋๋ ๋ถ์ฐ ํ๋ จ ๊ธฐ๋ฅ์ ์ง์ํฉ๋๋ค.
29+ ๋์ผํ ์งํฉ ํต์ API๋ฅผ ๋ค์ํ ํต์ ๋ฐฑ์๋์์ ์๋ํ๋๋ก ํ๊ธฐ ์ํด ๋ถ์ฐ ํจํค์ง๋ ์งํฉ ํต์ ์์
์
3530`ProcessGroup <https://github.com/pytorch/pytorch/blob/release/1.10/torch/csrc/distributed/c10d/ProcessGroup.hpp >`__
36- class. Different backends can
37- then be implemented as subclasses of ``ProcessGroup `` using preferred
38- third-party libraries. PyTorch distributed comes with three default backends,
39- ``ProcessGroupNCCL ``, ``ProcessGroupGloo ``, and ``ProcessGroupMPI ``. However,
40- beyond these three backends, there are also other communication libraries
41- (e.g., `UCC <https://github.com/openucx/ucc >`__,
42- `OneCCL <https://github.com/oneapi-src/oneCCL >`__), different types of hardware
43- (e.g., `TPU <https://cloud.google.com/tpu >`__,
44- `Trainum <https://aws.amazon.com/machine-learning/trainium/ >`__), and emerging
45- communication algorithms (e.g.,
46- `Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud >`__,
47- `Reduction Server <https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai >`__).
48- Therefore, the distributed package exposes extension APIs to allow customizing
49- collective communication backends.
50-
51-
52- The 4 steps below show how to implement a dummy ``ProcessGroup `` backend
53- and use that in Python application code. Please note that this tutorial focuses
54- on demonstrating the extension APIs, instead of developing a functioning
55- communication backend. Hence, the ``dummy `` backend just covers a subset of the
56- APIs (``all_reduce `` and ``all_gather ``), and simply sets the values of tensors
57- to 0.
58-
59-
60- Step 1: Implement a Subclass of ``ProcessGroup ``
31+ ํด๋์ค๋ก ์ถ์ํํฉ๋๋ค. ์ดํ์๋ ์ํ๋ ์๋ํํฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ``ProcessGroup `` ์ ํ์ ํด๋์ค๋ก ๋ค์ํ ๋ฐฑ์๋๋ฅผ ๊ตฌํํ ์ ์์ต๋๋ค.
32+ ํ์ดํ ์น ๋ถ์ฐ์๋ ์ธ ๊ฐ์ง ๊ธฐ๋ณธ ๋ฐฑ์๋์ธ ``ProcessGroupNCCL ``, ``ProcessGroupGloo ``, ๊ทธ๋ฆฌ๊ณ ``ProcessGroupMPI `` ๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
33+ ๊ทธ๋ฌ๋ ์ด ์ธ ๊ฐ์ง ๋ฐฑ์๋ ์ธ์๋ ๋ค๋ฅธ ํต์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ(์: `UCC <https://github.com/openucx/ucc >`__, `OneCCL <https://github.com/oneapi-src/oneCCL >`__), ๋ค๋ฅธ ์ ํ์ ํ๋์จ์ด(์: `TPU <https://cloud.google.com/tpu >`__, `Trainum <https://aws.amazon.com/machine-learning/trainium/ >`__),
34+ ๊ทธ๋ฆฌ๊ณ ์๋ก์ด ํต์ ์๊ณ ๋ฆฌ์ฆ(์: `Herring <https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud >`__, `Reduction Server <https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai >`__)๋ ์์ต๋๋ค.
35+ ๋ฐ๋ผ์ ๋ถ์ฐ ํจํค์ง๋ ์งํฉ ํต์ ๋ฐฑ์๋๋ฅผ ์ฌ์ฉ์ ์ง์ ํ ์ ์๋๋ก ํ์ฅ API๋ฅผ ๋
ธ์ถํฉ๋๋ค.
36+
37+
38+ ์๋์ 4๋จ๊ณ๋ ๋๋ฏธ ``ProcessGroup `` ๋ฐฑ์๋๋ฅผ ๊ตฌํํ๊ณ ํ์ด์ฌ ์์ฉ ํ๋ก๊ทธ๋จ ์ฝ๋์์ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค.
39+ ์ด ํํ ๋ฆฌ์ผ์ ์๋ํ๋ ํต์ ๋ฐฑ์๋๋ฅผ ๊ฐ๋ฐํ๋ ๋์ ํ์ฅ API๋ฅผ ์ค๋ช
ํ๋ ๋ฐ ์ค์ ์ ๋ก๋๋ค. ๋ฐ๋ผ์ ``dummy `` ๋ฐฑ์๋๋ API์ ์ผ๋ถ (``all_reduce `` ๋ฐ ``all_gather ``)๋ฅผ ๋ค๋ฃจ๋ฉฐ tensor์ ๊ฐ์ ๋จ์ํ 0์ผ๋ก ์ค์ ํฉ๋๋ค.
40+
41+
42+ ๋จ๊ณ 1: ``ProcessGroup `` ์ ํ์ ํด๋์ค ๊ตฌํ
6143------------------------------------------------
6244
63- This first step is to implement a ``ProcessGroup `` subclass that overrides
64- target collective communication APIs and runs the custom communication algorithm.
65- The extension also needs to implement a ``Work `` subclass, which
66- serves as a future of communication results and allows asynchronous execution in
67- application code. If the extension uses third-party libraries, it can
68- include the headers and call into the library APIs from the ``ProcessGroupDummy ``
69- subclass. The two code snippets below present the implementation of ``dummy.h `` and
70- ``dummy.cpp ``. See the `dummy collectives <https://github.com/mrshenli/dummy_collectives >`__
71- repository for the full implementation.
45+ ์ฒซ ๋ฒ์งธ ๋จ๊ณ๋ ๋์ ์งํฉ ํต์ API๋ฅผ ์ฌ์ ์ํ๊ณ ์ฌ์ฉ์ ์ ์ ํต์ ์๊ณ ๋ฆฌ์ฆ์ ์คํํ๋ ``ProcessGroup `` ํ์ ํด๋์ค๋ฅผ ๊ตฌํํ๋ ๊ฒ์
๋๋ค.
46+ ํ์ฅ ๊ธฐ๋ฅ์ ๋ฏธ๋(future) ํต์ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ณตํ๋ ``Work `` ํ์ ํด๋์ค๋ฅผ ๊ตฌํํด์ผ ํ๋ฉฐ, ์ด๋ ์์ฉ ํ๋ก๊ทธ๋จ ์ฝ๋์์ ๋น๋๊ธฐ ์คํ์ ํ์ฉํฉ๋๋ค.
47+ ํ์ฅ ๊ธฐ๋ฅ์ด ์๋ํํฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ๋ ๊ฒฝ์ฐ, ํด๋น ํ์ฅ ๊ธฐ๋ฅ์ ``ProcessGroupDummy `` ํ์ ํด๋์ค์์ ํค๋๋ฅผ ํฌํจํ๊ณ ๋ผ์ด๋ธ๋ฌ๋ฆฌ API๋ฅผ ํธ์ถํ ์ ์์ต๋๋ค.
48+ ์๋์ ๋ ์ฝ๋๋ ``dummy.h `` ๋ฐ ``dummy.cpp `` ์ ๊ตฌํ์ ๋ณด์ฌ์ค๋๋ค. ์ ์ฒด ๊ตฌํ์ `๋๋ฏธ ์งํฉ(dummy collectives) <https://github.com/mrshenli/dummy_collectives >`__ ์ ์ฅ์์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
7249
7350.. code-block :: cpp
7451
75- // file name : dummy.hpp
52+ // ํ์ผ ์ด๋ฆ : dummy.hpp
7653 #include <torch/python.h>
7754
7855 #include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
@@ -98,8 +75,8 @@ repository for the full implementation.
9875 std::vector<at::Tensor>& tensors,
9976 const AllreduceOptions& opts = AllreduceOptions()) override;
10077
101- // The collective communication APIs without a custom implementation
102- // will error out if invoked by application code .
78+ // ์ฌ์ฉ์ ์ ์ ๊ตฌํ์ด ์๋ ์ํ์์์ ์งํฉ ํต์ API๋
79+ // ์์ฉ ํ๋ก๊ทธ๋จ ์ฝ๋์์ ํธ์ถ๋๋ฉด ์ค๋ฅ๊ฐ ๋ฐ์ํฉ๋๋ค .
10380 };
10481
10582 class WorkDummy : public Work {
@@ -108,12 +85,11 @@ repository for the full implementation.
10885 OpType opType,
10986 c10::intrusive_ptr<c10::ivalue::Future> future) // future of the output
11087 : Work(
111- -1, // rank, only used by recvAnySource, irrelevant in this demo
88+ -1, // ๋ญํฌ, recvAnySource์์๋ง ์ฌ์ฉ๋๋ฉฐ ์ด ๋ฐ๋ชจ์์๋ ๊ด๋ จ์ด ์์ต๋๋ค.
11289 opType),
11390 future_(std::move(future)) {}
114- // There are several additional helper functions that need to be
115- // implemented. Please refer to https://github.com/mrshenli/dummy_collectives
116- // for the full implementation.
91+ // ์ถ๊ฐ์ ์ผ๋ก ๊ตฌํํด์ผ ํ๋ ์ฌ๋ฌ ๋์ฐ๋ฏธ ํจ์๋ค์ด ์์ต๋๋ค.
92+ // ์ ์ฒด ๊ตฌํ์ ๋ํ ์์ธํ ๋ด์ฉ์ https://github.com/mrshenli/dummy_collectives ๋ฅผ ์ฐธ์กฐํ์ธ์.
11793
11894 private:
11995 c10::intrusive_ptr<c10::ivalue::Future> future_;
@@ -123,13 +99,13 @@ repository for the full implementation.
12399
124100 .. code-block :: cpp
125101
126- // file name : dummy.cpp
102+ // ํ์ผ ์ด๋ฆ : dummy.cpp
127103 #include "dummy.hpp"
128104
129105 namespace c10d {
130106
131- // This is a dummy allgather that sets all output tensors to zero
132- // Modify the implementation to conduct real communication asynchronously
107+ // ์ด๊ฒ์ ๋ชจ๋ ์ถ๋ ฅ tensor๋ฅผ 0์ผ๋ก ์ค์ ํ๋ ๋๋ฏธ allgather์
๋๋ค.
108+ // ์ค์ ํต์ ์ ๋น๋๊ธฐ์ ์ผ๋ก ์ํํ๋๋ก ๊ตฌํ์ ์์ ํ์ธ์.
133109 c10::intrusive_ptr<Work> ProcessGroupDummy::allgather(
134110 std::vector<std::vector<at::Tensor>>& outputTensors,
135111 std::vector<at::Tensor>& inputTensors,
@@ -146,8 +122,8 @@ repository for the full implementation.
146122 return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));
147123 }
148124
149- // This is a dummy allreduce that sets all output tensors to zero
150- // Modify the implementation to conduct real communication asynchronously
125+ // ์ด๊ฒ์ ๋ชจ๋ ์ถ๋ ฅ tensor๋ฅผ 0์ผ๋ก ์ค์ ํ๋ ๋๋ฏธ allgather์
๋๋ค.
126+ // ์ค์ ํต์ ์ ๋น๋๊ธฐ์ ์ผ๋ก ์ํํ๋๋ก ๊ตฌํ์ ์์ ํ์ธ์.
151127 c10::intrusive_ptr<Work> ProcessGroupDummy::allreduce(
152128 std::vector<at::Tensor>& tensors,
153129 const AllreduceOptions& opts) {
@@ -162,17 +138,14 @@ repository for the full implementation.
162138 }
163139 } // namespace c10d
164140
165- Step 2: Expose The Extension Python APIs
141+ ๋จ๊ณ 2: ํ์ฅ ํ์ด์ฌ API ๋
ธ์ถ
166142----------------------------------------
167143
168- The backend constructors are called
169- `from Python side <https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/distributed_c10d.py#L643-L650 >`__,
170- so the extension also needs to expose the constructor APIs to Python. This can
171- be done by adding the following methods. In this example, ``store `` and
172- ``timeout `` are ignored by the ``ProcessGroupDummy `` instantiation method, as
173- those are not used in this dummy implementation. However, real-world extensions
174- should consider using the ``store `` to perform rendezvous and supporting the
175- ``timeout `` argument.
144+ ๋ฐฑ์๋ ์์ฑ์๋ `ํ์ด์ฌ ์ธก <https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/distributed_c10d.py#L643-L650 >`__ ์์
145+ ํธ์ถ๋๋ฏ๋ก ํ์ฅ ๊ธฐ๋ฅ๋ ํ์ด์ฌ์ ์์ฑ์ API๋ฅผ ๋
ธ์ถํด์ผ ํฉ๋๋ค.
146+ ๋ค์ ๋ฉ์๋๋ฅผ ์ถ๊ฐํจ์ผ๋ก์จ ์ด ์์
์ ์ํํ ์ ์์ต๋๋ค.
147+ ์ด ์์ ์์๋ ``store `` ์ ``timeout `` ์ด ์ฌ์ฉ๋์ง ์์ผ๋ฏ๋ก ``ProcessGroupDummy `` ์ธ์คํด์คํ ๋ฉ์๋์์ ๋ฌด์๋ฉ๋๋ค.
148+ ๊ทธ๋ฌ๋ ์ค์ ํ์ฅ ๊ธฐ๋ฅ์ ๋๋ฐ๋ทฐ๋ฅผ ์ํํ๊ณ ``timeout `` ์ธ์๋ฅผ ์ง์ํ๊ธฐ ์ํด ``store `` ์ฌ์ฉ์ ๊ณ ๋ คํด์ผ ํฉ๋๋ค.
176149
177150.. code-block :: cpp
178151
@@ -187,8 +160,7 @@ should consider using the ``store`` to perform rendezvous and supporting the
187160 py::object module = py::module::import("torch.distributed");
188161 py::object register_backend =
189162 module.attr("Backend").attr("register_backend");
190- // torch.distributed.Backend.register_backend will add `dummy` as a
191- // new valid backend.
163+ // torch.distributed.Backend.register_backend๋ '๋๋ฏธ'๋ฅผ ์๋ก์ด ์ ํจํ ๋ฐฑ์๋๋ก ์ถ๊ฐํฉ๋๋ค.
192164 register_backend("dummy", py::cpp_function(createProcessGroupDummy));
193165 }
194166 }
@@ -208,22 +180,17 @@ should consider using the ``store`` to perform rendezvous and supporting the
208180 }
209181
210182
211- Step 3: Build The Custom Extension
183+ ๋จ๊ณ 3: ์ฌ์ฉ์ ์ ์ ํ์ฅ ๋น๋
212184----------------------------------
213185
214- Now, the extension source code files are ready. We can then use
215- `cpp extensions <https://pytorch.org/docs/stable/cpp_extension.html >`__
216- to build it. To do that, create a ``setup.py `` file that prepares the paths and
217- commands. Then call ``python setup.py install `` to install the extension.
186+ ์ด์ ํ์ฅ ์์ค ์ฝ๋ ํ์ผ์ด ์ค๋น๋์์ต๋๋ค. ๊ทธ๋ฐ ๋ค์ `cpp ํ์ฅ <https://pytorch.org/docs/stable/cpp_extension.html >`__ ์ ์ฌ์ฉํ์ฌ ๋น๋ํ ์ ์์ต๋๋ค.
187+ ์ด๋ฅผ ์ํด ๊ฒฝ๋ก์ ๋ช
๋ น์ ์ค๋นํ๋ ``setup.py `` ํ์ผ์ ์์ฑํ๊ณ , ``python setup.py install `` ์ ํธ์ถํ์ฌ ํ์ฅ์ ์ค์นํฉ๋๋ค.
218188
219- If the extension depends on third-party libraries, you can also specify
220- ``libraries_dirs `` and ``libraries `` to the cpp extension APIs. See the
221- `torch ucc <https://github.com/openucx/torch-ucc >`__
222- project as a real-world example.
189+ ํ์ฅ์ด ์๋ํํฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ์์กดํ๋ ๊ฒฝ์ฐ, cpp ํ์ฅ API์ ``libraries_dirs `` ๋ฐ ``libraries `` ์ง์ ํ ์๋ ์์ต๋๋ค. ์ค์ ์์ ๋ก `torch ucc <https://github.com/openucx/torch-ucc >`__ ํ๋ก์ ํธ๋ฅผ ์ฐธ์กฐํ์ญ์์ค.
223190
224191.. code-block :: python
225192
226- # file name : setup.py
193+ # ํ์ผ ์ด๋ฆ : setup.py
227194 import os
228195 import sys
229196 import torch
@@ -253,20 +220,17 @@ project as a real-world example.
253220 cmdclass = {' build_ext' : cpp_extension.BuildExtension}
254221 )
255222
256- Step 4: Use The Extension in Application
223+ ๋จ๊ณ 4: ์์ฉ ํ๋ก๊ทธ๋จ์์ ํ์ฅ ๊ธฐ๋ฅ ์ฌ์ฉ
257224----------------------------------------
258225
259- After installation, you can conveniently use the ``dummy `` backend when calling
260- `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group >`__
261- as if it is an builtin backend.
226+ ์ค์น ํ `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group >`__ ์ ํธ์ถํ ๋ ``๋๋ฏธ `` ๋ฐฑ์๋๋ฅผ ๋ด์ฅ๋ ๋ฐฑ์๋์ฒ๋ผ ํธ๋ฆฌํ๊ฒ ์ฌ์ฉํ ์ ์์ต๋๋ค.
262227
263228.. code-block :: python
264229
265230 import os
266231
267232 import torch
268- # importing dummy_collectives makes torch.distributed recognize `dummy`
269- # as a valid backend.
233+ # dummy_collectives๋ฅผ importํ๋ฉด torch.distributed๊ฐ `๋๋ฏธ`๋ฅผ ์ ํจํ ๋ฐฑ์๋๋ก ์ธ์ํฉ๋๋ค.
270234 import dummy_collectives
271235
272236 import torch.distributed as dist
0 commit comments