66`minGPT Training <../intermediate/ddp_series_minGPT.html >`__
77
88
9- Multi GPU training with DDP
9+ DDPλ₯Ό μ΄μ©ν λ€μ€ GPU νλ ¨
1010===========================
1111
12- Authors: `Suraj Subramanian <https://github.com/suraj813 >`__
12+ μ μ: `Suraj Subramanian <https://github.com/suraj813 >`__
13+ μμ: `Nathan Kim <https://github.com/NK590 >`__
1314
1415.. grid :: 2
1516
16- .. grid-item-card :: :octicon:`mortar-board;1em;` What you will learn
17+ .. grid-item-card :: :octicon:`mortar-board;1em;` μ¬κΈ°μμ λ°°μ°λ κ²
1718
18- - How to migrate a single- GPU training script to multi- GPU via DDP
19- - Setting up the distributed process group
20- - Saving and loading models in a distributed setup
19+ - DDPλ₯Ό μ΄μ©νμ¬ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈλ₯Ό λ€μ€ GPU νμ΅ μ€ν¬λ¦½νΈλ‘ λ°κΎΈλ λ²
20+ - λΆμ° νλ‘μΈμ€ κ·Έλ£Ή( distributed process group)μ μ€μ νλ λ²
21+ - λΆμ° νκ²½μμ λͺ¨λΈμ μ μ₯ λ° μ½μ΄μ€λ λ²
2122
2223 .. grid :: 1
2324
2425 .. grid-item ::
2526
26- :octicon: `code-square;1.0em; ` View the code used in this tutorial on `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__
27+ :octicon: `code-square;1.0em; ` μ΄ νν 리μΌμμ μ¬μ©λ μ½λλ `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__ μμ νμΈ κ°λ₯
2728
28- .. grid-item-card :: :octicon:`list-unordered;1em;` Prerequisites
29+ .. grid-item-card :: :octicon:`list-unordered;1em;` λ€μ΄κ°κΈ° μμ μ€λΉν κ²
30+
31+ * `DDPκ° μ΄λ»κ² λμνλμ§ <ddp_series_theory.html >`__ μ λν μ λ°μ μΈ μ΄ν΄λ
32+ * λ€μ€ GPUλ₯Ό κ°μ§ νλμ¨μ΄ (μ΄ νν 리μΌμμλ AWS p3.8xlarge μΈμ€ν΄μ€λ₯Ό μ΄μ©ν¨)
33+ * CUDA νκ²½μμ `μ€μΉλ PyTorch <https://pytorch.org/get-started/locally/ >`__
2934
30- * High-level overview of `how DDP works <ddp_series_theory.html >`__
31- * A machine with multiple GPUs (this tutorial uses an AWS p3.8xlarge instance)
32- * PyTorch `installed <https://pytorch.org/get-started/locally/ >`__ with CUDA
33-
34- Follow along with the video below or on `youtube <https://www.youtube.com/watch/-LAtx9Q6DA8 >`__.
35+ μλμ λΉλμ€ νΉμ `μ νλΈ <https://www.youtube.com/watch/-LAtx9Q6DA8 >`__ λ μ°Έκ³ ν΄μ£ΌμΈμ.
3536
3637.. raw :: html
3738
3839 <div style =" margin-top :10px ; margin-bottom :10px ;" >
3940 <iframe width =" 560" height =" 315" src =" https://www.youtube.com/embed/-LAtx9Q6DA8" frameborder =" 0" allow =" accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen ></iframe >
4041 </div >
4142
42- In the ` previous tutorial <ddp_series_theory.html >`__, we got a high-level overview of how DDP works; now we see how to use DDP in code .
43- In this tutorial, we start with a single- GPU training script and migrate that to running it on 4 GPUs on a single node .
44- Along the way, we will talk through important concepts in distributed training while implementing them in our code .
43+ ` μ΄μ νν λ¦¬μΌ <ddp_series_theory.html >`__ μμ, DDPκ° μ΄λ»κ² λμνλμ§μ λν΄ μ λ°μ μΌλ‘ μμ보μμΌλ―λ‘, μ΄μ μ€μ λ‘ DDPλ₯Ό μ΄λ»κ² μ¬μ©νλμ§ μ½λλ₯Ό λ³Ό μ°¨λ‘μ
λλ€ .
44+ μ΄ νν 리μΌμμλ, λ¨Όμ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈμμ μμνμ¬, λ¨μΌ λ
Έλλ₯Ό κ°μ§ 4κ°μ GPUμμ λμνκ² λ§λ€ κ²μ
λλ€ .
45+ μ΄ κ³Όμ μμ, λΆμ° νλ ¨(distributed training)μ λν μ€μν κ°λ
λ€μ μ§μ μ½λλ‘ κ΅¬ννλ©΄μ λ€λ£¨κ² λ κ²μ
λλ€ .
4546
4647.. note ::
47- If your model contains any ``BatchNorm `` layers, it needs to be converted to ``SyncBatchNorm `` to sync the running stats of ``BatchNorm ``
48- layers across replicas.
49-
50- Use the helper function
51- `torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) <https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm >`__ to convert all ``BatchNorm `` layers in the model to ``SyncBatchNorm ``.
48+ λ§μ½ λΉμ μ λͺ¨λΈμ΄ ``BatchNorm `` λ μ΄μ΄λ₯Ό κ°μ§κ³ μλ€λ©΄, ν΄λΉ λ μ΄μ΄ κ° λμ μν©μ λκΈ°νλ₯Ό μν΄ μ΄κ±Έ λͺ¨λ ``SyncBatchNorm `` μΌλ‘ λ°κΏ νμκ° μμ΅λλ€.
5249
50+ λμ ν¨μ(helper function)
51+ `torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) <https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm >`__ λ₯Ό μ΄μ©νμ¬ λͺ¨λΈ μμ ``BatchNorm `` λ μ΄μ΄λ₯Ό ``SyncBatchNorm `` λ μ΄μ΄λ‘ λ°κΏμ£ΌμΈμ.
5352
54- Diff for `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py >`__ v/s `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__
53+ `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py >`__ μ `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py >`__ μ μ°¨μ΄
5554
56- These are the changes you typically make to a single-GPU training script to enable DDP .
55+ μ μ½λμ μ°¨μ΄λ₯Ό λΉκ΅νλ©΄μ μΌλ°μ μΌλ‘ λ¨μΌ GPU νμ΅ μ€ν¬λ¦½νΈμμ DDPλ₯Ό μ μ©νλ λ²μ μ μ μμ΅λλ€ .
5756
58- Imports
57+ μν¬νΈ
5958~~~~~~~
60- - ``torch.multiprocessing `` is a PyTorch wrapper around Python's native
61- multiprocessing
62- - The distributed process group contains all the processes that can
63- communicate and synchronize with each other.
59+ - ``torch.multiprocessing `` μ Pythonμ λ€μ΄ν°λΈ λ©ν°νλ‘μΈμ± λͺ¨λμ λνΌ(wrapper)μ
λλ€.
60+
61+ - λΆμ° νλ‘μΈμ€ κ·Έλ£Ή(distributed process group)μ μλ‘ μ 보 κ΅νμ΄ κ°λ₯νκ³ λκΈ°νκ° κ°λ₯ν λͺ¨λ νλ‘μΈμ€λ€μ ν¬ν¨ν©λλ€.
6462
6563.. code-block :: diff
6664
@@ -75,18 +73,15 @@ Imports
7573 + import os
7674
7775
78- Constructing the process group
76+ νλ‘μΈμ€ κ·Έλ£Ή ꡬμ±
7977~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8078
81- - First, before initializing the group process, call `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device >`__,
82- which sets the default GPU for each process. This is important to prevent hangs or excessive memory utilization on `GPU:0 `
83- - The process group can be initialized by TCP (default) or from a
84- shared file-system. Read more on `process group
85- initialization <https://pytorch.org/docs/stable/distributed.html#tcp-initialization> `__
86- - `init_process_group <https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group >`__
87- initializes the distributed process group.
88- - Read more about `choosing a DDP
89- backend <https://pytorch.org/docs/stable/distributed.html#which-backend-to-use> `__
79+ - λ¨Όμ , κ·Έλ£Ή νλ‘μΈμ€λ₯Ό μ΄κΈ°ννκΈ° μ μ, `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device >`__ λ₯Ό νΈμΆνμ¬
80+ κ°κ°μ νλ‘μΈμ€μ GPUλ₯Ό ν λΉν΄μ£ΌμΈμ. μ΄ κ³Όμ μ `GPU:0 ` μ κ³Όλν λ©λͺ¨λ¦¬ μ¬μ© νΉμ λ©μΆ€ νμμ λ°©μ§νκΈ° μν΄ μ€μν©λλ€.
81+ - μ΄ νλ‘μΈμ€ κ·Έλ£Ήμ TCP(κΈ°λ³Έ) νΉμ 곡μ νμΌ μμ€ν
λ±μ ν΅νμ¬ μ΄κΈ°νλ μ μμ΅λλ€.
82+ μμΈν λ΄μ©μ `νλ‘μΈμ€ κ·Έλ£Ή μ΄κΈ°ν <https://pytorch.org/docs/stable/distributed.html#tcp-initialization >`__ λ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.
83+ - `init_process_group <https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group >`__ μΌλ‘ λΆμ° νλ‘μΈμ€ κ·Έλ£Ήμ μ΄κΈ°νμν΅λλ€.
84+ - μΆκ°μ μΈ λ΄μ©μ `DDP λ°±μλ μ ν <https://pytorch.org/docs/stable/distributed.html#which-backend-to-use >`__ μ μ°Έκ³ ν΄μ£ΌμΈμ.
9085
9186.. code-block :: diff
9287
@@ -103,21 +98,21 @@ Constructing the process group
10398
10499
105100
106- Constructing the DDP model
101+ DDP λͺ¨λΈ ꡬμΆ
107102~~~~~~~~~~~~~~~~~~~~~~~~~~
108103
109104.. code-block :: diff
110105
111106 - self.model = model.to(gpu_id)
112107 + self.model = DDP(model, device_ids=[gpu_id])
113108
114- Distributing input data
109+ μ
λ ₯ λ°μ΄ν° λΆμ°
115110~~~~~~~~~~~~~~~~~~~~~~~
116111
117- - `DistributedSampler <https://pytorch.org/docs/stable/data.html?highlight=distributedsampler#torch.utils.data.distributed.DistributedSampler >`__
118- chunks the input data across all distributed processes .
119- - Each process will receive an input batch of 32 samples; the effective
120- batch size is ``32 * nprocs ``, or 128 when using 4 GPUs .
112+ - `DistributedSampler <https://pytorch.org/docs/stable/data.html?highlight=distributedsampler#torch.utils.data.distributed.DistributedSampler >`__
113+ λ₯Ό μ΄μ©νμ¬ λͺ¨λ λΆμ° νλ‘μΈμ€μ μ
λ ₯ λ°μ΄ν°λ₯Ό λλλλ€ .
114+ - κ°κ°μ νλ‘μΈμ€λ 32κ° μν ν¬κΈ°μ μ
λ ₯ λ°°μΉλ₯Ό λ°μ΅λλ€.
115+ μ΄μμ μΈ λ°°μΉ ν¬κΈ°λ ``32 * nprocs ``, νΉμ 4κ°μ GPUλ₯Ό μ¬μ©ν λ 128μ
λλ€ .
121116
122117.. code-block :: diff
123118
@@ -129,8 +124,8 @@ Distributing input data
129124 + sampler=DistributedSampler(train_dataset),
130125 )
131126
132- - Calling the `` set_epoch() `` method on the `` DistributedSampler `` at the beginning of each epoch is necessary to make shuffling work
133- properly across multiple epochs. Otherwise, the same ordering will be used in each epoch .
127+ - λ§€ μν(epoch)μ μμλ§λ€ `` DistributedSampler `` μ `` set_epoch() `` λ©μλλ₯Ό νΈμΆνλ κ²μ λ€μμ μνμμ μμλ₯Ό μ μ ν μκΈ° μν΄ νμμ μ
λλ€.
128+ μ΄λ₯Ό μ¬μ©νμ§ μμ κ²½μ°, λ§€ μνλ§λ€ κ°μ μμκ° μ¬μ©λ©λλ€ .
134129
135130.. code-block :: diff
136131
@@ -142,12 +137,12 @@ Distributing input data
142137 self._run_batch(source, targets)
143138
144139
145- Saving model checkpoints
140+ λͺ¨λΈ 체ν¬ν¬μΈνΈ( checkpoints) μ μ₯
146141~~~~~~~~~~~~~~~~~~~~~~~~
147- - We only need to save model checkpoints from one process. Without this
148- condition, each process would save its copy of the identical mode. Read
149- more on saving and loading models with
150- DDP ` here < https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html#save-and-load-checkpoints >`__
142+ - λͺ¨λΈ 체ν¬ν¬μΈνΈλ₯Ό μ μ₯ν λ, νλμ νλ‘μΈμ€μ λν΄μλ§ μ²΄ν¬ν¬μΈνΈλ₯Ό μ μ₯νλ©΄ λ©λλ€. μ΄λ κ² νμ§ μμΌλ©΄,
143+ κ°κ°μ νλ‘μΈμ€κ° λͺ¨λ λμΌν μνλ₯Ό μ μ₯νκ² λ κ²μ
λλ€.
144+ ` μ¬κΈ° < https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html#save- and-load-checkpoints >`__ μμ
145+ DDP νκ²½μμ λͺ¨λΈμ μ μ₯κ³Ό μ½μ΄μ€κΈ° λ±μ λν΄ μμΈν λ΄μ©μ νμΈν μ μμ΅λλ€.
151146
152147.. code-block :: diff
153148
@@ -160,21 +155,19 @@ Saving model checkpoints
160155 self._save_checkpoint(epoch)
161156
162157 .. warning ::
163- `Collective calls <https://pytorch.org/docs/stable/distributed.html#collective-functions >`__ are functions that run on all the distributed processes,
164- and they are used to gather certain states or values to a specific process. Collective calls require all ranks to run the collective code.
165- In this example, `_save_checkpoint ` should not have any collective calls because it is only run on the ``rank:0 `` process.
166- If you need to make any collective calls, it should be before the ``if self.gpu_id == 0 `` check.
167-
158+ `μ§ν© μ½(Collective Calls) <https://pytorch.org/docs/stable/distributed.html#collective-functions >`__ μ λͺ¨λ λΆμ° νλ‘μΈμ€μμ λμνλ ν¨μ(functions)μ΄λ©°,
159+ νΉμ νλ‘μΈμ€μ νΉμ ν μνλ κ°μ λͺ¨μΌκΈ° μν΄ μ¬μ©λ©λλ€. μ§ν© μ½μ μ§ν© μ½λ(collective code)λ₯Ό μ€ννκΈ° μν΄ λͺ¨λ λν¬(rank)λ₯Ό νμλ‘ ν©λλ€.
160+ μ΄ μμ μμ, `_save_checkpoint`λ μ€λ‘μ§ ``rank:0 `` νλ‘μΈμ€μμλ§ μ€νλκΈ° λλ¬Έμ, μ΄λ ν μ§ν© μ½λ κ°μ§κ³ μμΌλ©΄ μ λ©λλ€.
161+ λ§μ½ μ§ν© μ½μ λ§λ€μ΄μΌ λλ€λ©΄, ``if self.gpu_id == 0 `` νμΈ μ΄μ μ λ§λ€μ΄μ ΈμΌ ν©λλ€.
168162
169- Running the distributed training job
163+ λΆμ° νμ΅ μμ
μ μ€ν
170164~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171165
172- - Include new arguments ``rank `` (replacing ``device ``) and
173- ``world_size ``.
174- - ``rank `` is auto-allocated by DDP when calling
175- `mp.spawn <https://pytorch.org/docs/stable/multiprocessing.html#spawning-subprocesses >`__.
176- - ``world_size `` is the number of processes across the training job. For GPU training,
177- this corresponds to the number of GPUs in use, and each process works on a dedicated GPU.
166+ - μλ‘μ΄ μΈμκ° ``rank `` (``device `` λ₯Ό λ체)μ ``world_size `` λ₯Ό λμ
ν©λλ€.
167+ - ``rank `` λ `mp.spawn <https://pytorch.org/docs/stable/multiprocessing.html#spawning-subprocesses >`__ μ νΈμΆν λ
168+ DDPμ μν΄ μλμ μΌλ‘ ν λΉλ©λλ€.
169+ - ``world_size `` λ νμ΅ μμ
μ μ΄μ©λλ νλ‘μΈμ€μ κ°μμ
λλ€. GPUλ₯Ό μ΄μ©ν νμ΅μ κ²½μ°μλ,
170+ μ΄ κ°μ νμ¬ μ¬μ©μ€μΈ GPUμ κ°μ λ° ν GPUμ ν λΉλ νλ‘μΈμ€μ κ°μμ ν΄λΉν©λλ€.
178171
179172.. code-block :: diff
180173
@@ -199,11 +192,10 @@ Running the distributed training job
199192
200193
201194
202- Further Reading
195+ λ μ½μ거리
203196---------------
204197
205- - `Fault Tolerant distributed training <ddp_series_fault_tolerance.html >`__ (next tutorial in this series)
206- - `Intro to DDP <ddp_series_theory.html >`__ (previous tutorial in this series)
207- - `Getting Started with DDP <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html >`__
208- - `Process Group
209- initialization <https://pytorch.org/docs/stable/distributed.html#tcp-initialization> `__
198+ - `κ²°ν¨ νμ©(fault tolerant) λΆμ° μμ€ν
<ddp_series_fault_tolerance.html >`__ (λ³Έ μ리μ¦μ λ€μ νν 리μΌ)
199+ - `DDP μ
λ¬Έ <ddp_series_theory.html >`__ (λ³Έ μ리μ¦μ μ΄μ νν 리μΌ)
200+ - `λΆμ° λ°μ΄ν° λ³λ ¬ μ²λ¦¬(DDP) μμνκΈ° <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html >`__
201+ - `νλ‘μΈμ€ κ·Έλ£Ή μ΄κΈ°ν <https://pytorch.org/docs/stable/distributed.html#tcp-initialization >`__
0 commit comments