11"""
2- Training Transformer models using Distributed Data Parallel and Pipeline Parallelism
2+ λΆμ° λ°μ΄ν° λ³λ ¬ μ²λ¦¬μ νμ΄νλΌμΈ λ³λ ¬νλ₯Ό μ¬μ©ν νΈλμ€ν¬λ¨Έ λͺ¨λΈ νμ΅
33====================================================================================
44
55**Author**: `Pritam Damania <https://github.com/pritamdamania87>`_
6+ **λ²μ**: `λ°±μ ν¬ <https://github.com/spongebob03>`_
7+
8+ μ΄ νν 리μΌμ `λΆμ° λ°μ΄ν° λ³λ ¬μ²λ¦¬(Distributed Data Parallel) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ μ
9+ `νμ΄νλΌμΈ λ³λ ¬ν <https://pytorch.org/docs/stable/pipeline.html>`__
10+ λ₯Ό μ¬μ©νμ¬ μ¬λ¬ GPUμ κ±ΈμΉ κ±°λν νΈλμ€ν¬λ¨Έ(transformer) λͺ¨λΈμ μ΄λ»κ² νμ΅μν€λμ§ λ³΄μ¬μ€λλ€.
11+ μ΄λ² νν 리μΌμ `NN.TRANSFORMER μ TORCHTEXT λ‘ μνμ€-ν¬-μνμ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§νκΈ° <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ μ
12+ νμ₯νμ΄λ©° νμ΄νλΌμΈ λ³λ ¬νκ° μ΄λ»κ² νΈλμ€ν¬λ¨Έ λͺ¨λΈ νμ΅μ μ°μ΄λμ§ μ¦λͺ
νκΈ° μν΄ μ΄μ νν 리μΌμμμ
13+ λͺ¨λΈ κ·λͺ¨λ₯Ό μ¦κ°μμΌ°μ΅λλ€.
614
7- This tutorial demonstrates how to train a large Transformer model across
8- multiple GPUs using `Distributed Data Parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ and
9- `Pipeline Parallelism <https://pytorch.org/docs/stable/pipeline.html>`__. This tutorial is an extension of the
10- `Sequence-to-Sequence Modeling with nn.Transformer and TorchText <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ tutorial
11- and scales up the same model to demonstrate how Distributed Data Parallel and
12- Pipeline Parallelism can be used to train Transformer models.
13-
14- Prerequisites:
15+ μ μκ³Όλͺ©(Prerequisites):
1516
1617 * `Pipeline Parallelism <https://pytorch.org/docs/stable/pipeline.html>`__
17- * `Sequence-to-Sequence Modeling with nn.Transformer and TorchText <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__
18- * `Getting Started with Distributed Data Parallel <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__
18+ * `NN.TRANSFORMER μ TORCHTEXT λ‘ μνμ€-ν¬-μνμ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§νκΈ° <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__
19+ * `λΆμ° λ°μ΄ν° λ³λ ¬ μ²λ¦¬ μμνκΈ° <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__
1920"""
2021
2122
2223######################################################################
23- # Define the model
24+ # λͺ¨λΈ μ μνκΈ°
2425# ----------------
2526#
2627
2728######################################################################
28- # ``PositionalEncoding`` module injects some information about the
29- # relative or absolute position of the tokens in the sequence. The
30- # positional encodings have the same dimension as the embeddings so that
31- # the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
32- # different frequencies .
29+ # ``PositionalEncoding`` λͺ¨λμ μνμ€μμ ν ν°μ μλμ , μ λ μμΉμ λν
30+ # μΌλΆ μ 보λ₯Ό μ£Όμ
ν©λλ€.
31+ # μμΉ μΈμ½λ©μ μλ² λ©κ³Ό κ°μ μ°¨μμ κ°μ§λ―λ‘
32+ # λμ ν©μΉ μ μμ΅λλ€. μ¬κΈ°μ, μ£Όνμκ° λ€λ₯Έ ``sine``κ³Ό ``cosine`` κΈ°λ₯μ
33+ # μ¬μ©ν©λλ€ .
3334
3435import sys
3536import os
@@ -60,23 +61,23 @@ def forward(self, x):
6061
6162
6263######################################################################
63- # In this tutorial, we will split a Transformer model across two GPUs and use
64- # pipeline parallelism to train the model. In addition to this, we use
65- # `Distributed Data Parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
66- # to train two replicas of this pipeline. We have one process driving a pipe across
67- # GPUs 0 and 1 and another process driving a pipe across GPUs 2 and 3. Both these
68- # processes then use Distributed Data Parallel to train the two replicas. The
69- # model is exactly the same model used in the `Sequence-to-Sequence Modeling with nn.Transformer and TorchText
70- # <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ tutorial,
71- # but is split into two stages. The largest number of parameters belong to the
72- # `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ layer.
73- # The `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__
74- # itself consists of ``nlayers`` of ` nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
75- # As a result, our focus is on ``nn.TransformerEncoder`` and we split the model
76- # such that half of the ``nn.TransformerEncoderLayer`` are on one GPU and the
77- # other half are on another. To do this, we pull out the ``Encoder`` and
78- # ``Decoder`` sections into seperate modules and then build an nn.Sequential
79- # representing the original Transformer module .
64+ # μ΄λ² νν 리μΌμμλ, νΈλμ€ν¬λ¨Έ λͺ¨λΈμ λ κ°μ GPUμ κ±Έμ³μ λλκ³
65+ # νμ΄νλΌμΈ λ³λ ¬νλ‘ νμ΅μμΌ λ³΄κ² μ΅λλ€. μΆκ°λ‘,
66+ # `λΆμ° λ°μ΄ν° λ³λ΄ μ²λ¦¬ <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
67+ # λ₯Ό μ¬μ©νμ¬ μ΄ νμ΄νλΌμΈμ λ 볡μ λ₯Ό νλ ¨μν΅λλ€. ν νλ‘μΈμ€λ
68+ # GPUs 0, 1μ κ±°μ³ νμ΄νλ₯Ό ꡬλνκ³ λ€λ₯Έ νλ‘μΈμ€λ GPUs 2, 3μμ νμ΄νλ₯Ό ꡬλν©λλ€. κ·Έ λ€μ, μ΄ λ
69+ # νλ‘μΈμ€λ λΆμ° λ°μ΄ν° λ³λ ¬μ²λ¦¬λ‘ λ 볡μ λ³Έμ νμ΅μν΅λλ€.
70+ # λͺ¨λΈμ λ°λ‘ `NN.TRANSFORMER μ TORCHTEXT λ‘ μνμ€-ν¬-μνμ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§νκΈ°
71+ # <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ νν 리μΌκ³Ό
72+ # λκ°μ λͺ¨λΈμ΄μ§λ§ λ λ¨κ³λ‘ λλ©λλ€. λλΆλΆ νλΌλ―Έν°(parameter)λ€μ
73+ # `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ κ³μΈ΅( layer)μ ν¬ν¨λ©λλ€ .
74+ # `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ λ
75+ # ` nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__ μ ``nlayers`` λ‘ κ΅¬μ±λμ΄ μμ΅λλ€ .
76+ # κ²°κ³Όμ μΌλ‘, μ°λ¦¬λ ``nn.TransformerEncoder`` μ μ€μ μ λκ³ μμΌλ©°
77+ # ``nn.TransformerEncoderLayer`` μ μ λ°μ ν GPUμ λκ³
78+ # λλ¨Έμ§ μ λ°μ λ€λ₯Έ GPUμ μλλ‘ λͺ¨λΈμ λΆν ν©λλ€. μ΄λ₯Ό μν΄μ ``Encoder`` μ
79+ # ``Decoder`` μΉμ
μ λΆλ¦¬λ λͺ¨λλ‘ λΉΌλΈ λ€μ, μλ³Έ νΈλμ€ν¬λ¨Έ λͺ¨λμ
80+ # λνλ΄λ nn.Sequentialμ λΉλ ν©λλ€ .
8081
8182
8283if sys .platform == 'win32' :
@@ -120,33 +121,31 @@ def forward(self, inp):
120121 return self .decoder (inp ).permute (1 , 0 , 2 )
121122
122123######################################################################
123- # Start multiple processes for training
124+ # νμ΅μ μν λ€μ€ νλ‘μΈμ€ μμ
124125# -------------------------------------
125126#
126127
127128
128129######################################################################
129- # We start two processes where each process drives its own pipeline across two
130- # GPUs. ``run_worker`` is executed for each process .
130+ # κ°μ λ κ°μ GPUμμ μ체 νμ΄νλΌμΈμ ꡬλνλ λ κ°μ§ νλ‘μΈμ€λ₯Ό μμν©λλ€.
131+ # ``run_worker``λ κ° νλ‘μΈμ€μ μ€νλ©λλ€ .
131132
132133def run_worker (rank , world_size ):
133134
134135
135136######################################################################
136- # Load and batch data
137+ # λ°μ΄ν° λ‘λνκ³ λ°°μΉ λ§λ€κΈ°
137138# -------------------
138139#
139140
140141
141142######################################################################
142- # The training process uses Wikitext-2 dataset from ``torchtext``. The
143- # vocab object is built based on the train dataset and is used to numericalize
144- # tokens into tensors. Starting from sequential data, the ``batchify()``
145- # function arranges the dataset into columns, trimming off any tokens remaining
146- # after the data has been divided into batches of size ``batch_size``.
147- # For instance, with the alphabet as the sequence (total length of 26)
148- # and a batch size of 4, we would divide the alphabet into 4 sequences of
149- # length 6:
143+ # νμ΅ νλ‘μΈμ€λ ``torchtext`` μ Wikitext-2 λ°μ΄ν°μ
μ μ¬μ©ν©λλ€.
144+ # λ¨μ΄ μ€λΈμ νΈλ νλ ¨ λ°μ΄ν°μ
μΌλ‘ λ§λ€μ΄μ§κ³ , ν ν°μ ν
μ(tensor)λ‘ μμΉννλλ° μ¬μ©λ©λλ€.
145+ # μνμ€ λ°μ΄ν°λ‘λΆν° μμνμ¬, ``batchify()`` ν¨μλ λ°μ΄ν°μ
μ μ΄(column)λ€λ‘ μ 리νκ³ ,
146+ # ``batch_size`` μ¬μ΄μ¦μ λ°°μΉλ€λ‘ λλ νμ λ¨μ λͺ¨λ ν ν°μ λ²λ¦½λλ€.
147+ # μλ₯Ό λ€μ΄, μνλ²³μ μνμ€(μ΄ κΈΈμ΄ 26)λ‘ μκ°νκ³ λ°°μΉ μ¬μ΄μ¦λ₯Ό 4λΌκ³ νλ€λ©΄,
148+ # μνλ²³μ κΈΈμ΄κ° 6μΈ 4κ°μ μνμ€λ‘ λλ μ μμ΅λλ€:
150149#
151150# .. math::
152151# \begin{bmatrix}
@@ -160,9 +159,9 @@ def run_worker(rank, world_size):
160159# \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
161160# \end{bmatrix}
162161#
163- # These columns are treated as independent by the model, which means that
164- # the dependence of ``G`` and ``F`` can not be learned, but allows more
165- # efficient batch processing.
162+ # μ΄ μ΄λ€μ λͺ¨λΈμ μν΄μ λ
립μ μΌλ‘ μ·¨κΈλλ©°, μ΄λ
163+ # ``G``Β μΒ ``F``Β μ μμ‘΄μ±μ΄ νμ΅λ μ μλ€λ κ²μ μλ―Ένμ§λ§, λ ν¨μ¨μ μΈ
164+ # λ°°μΉ νλ‘μΈμ±( batch processing)μ νμ©ν©λλ€ .
166165#
167166
168167# In 'run_worker'
@@ -210,23 +209,23 @@ def batchify(data, bsz, rank, world_size, is_train=False):
210209
211210
212211######################################################################
213- # Functions to generate input and target sequence
212+ # μ
λ ₯κ³Ό νκ² μνμ€λ₯Ό μμ±νκΈ° μν ν¨μλ€
214213# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215214#
216215
217216
218217######################################################################
219- # ``get_batch()`` function generates the input and target sequence for
220- # the transformer model. It subdivides the source data into chunks of
221- # length ``bptt``. For the language modeling task, the model needs the
222- # following words as ``Target``. For example, with a ``bptt`` value of 2 ,
223- # weβd get the following two Variables for ``i`` = 0:
218+ # ``get_batch()`` ν¨μλ νΈλμ€ν¬λ¨Έ λͺ¨λΈμ μν μ
λ ₯κ³Ό νκ² μνμ€λ₯Ό
219+ # μμ±ν©λλ€. μ΄ ν¨μλ μμ€ λ°μ΄ν°λ₯Ό ``bptt`` κΈΈμ΄λ₯Ό κ°μ§ λ©μ΄λ¦¬λ‘ μΈλΆνν©λλ€.
220+ # μΈμ΄ λͺ¨λΈλ§ κ³Όμ λ₯Ό μν΄μ, λͺ¨λΈμ
221+ # λ€μ λ¨μ΄μΈ ``Target`` μ΄ νμν©λλ€. μλ₯Ό λ€μ΄ ``bptt`` μ κ°μ΄ 2λΌλ©΄ ,
222+ # ``i`` = 0 μΌ λ λ€μμ 2 κ° λ³μ(Variable)λ₯Ό μ»μ μ μμ΅λλ€ :
224223#
225224# .. image:: ../_static/img/transformer_input_target.png
226- #
227- # It should be noted that the chunks are along dimension 0, consistent
228- # with the ``S`` dimension in the Transformer model. The batch dimension
229- # ``N`` is along dimension 1 .
225+ #
226+ # μ²ν¬κ° μ°¨μ 0μ μνλ©°
227+ # νΈλμ€ν¬λ¨Έ λͺ¨λΈμ ''S'' μ°¨μκ³Ό μΌμΉνλ€λ κ²μ μ μν΄μΌ ν©λλ€.
228+ # λ°°μΉ μ°¨μΒ ``N``Β μ 1 μ°¨μμ ν΄λΉν©λλ€ .
230229#
231230
232231# In 'run_worker'
@@ -239,27 +238,27 @@ def get_batch(source, i):
239238 return data .t (), target
240239
241240######################################################################
242- # Model scale and Pipe initialization
241+ # λͺ¨λΈ κ·λͺ¨μ νμ΄ν μ΄κΈ°ν
243242# -----------------------------------
244243#
245244
246245
247246######################################################################
248- # To demonstrate training large Transformer models using pipeline parallelism ,
249- # we scale up the Transformer layers appropriately . We use an embedding
250- # dimension of 4096, hidden size of 4096, 16 attention heads and 8 total
251- # transformer layers (``nn.TransformerEncoderLayer``). This creates a model with
252- # **~1 billion ** parameters .
247+ # νμ΄νλΌμΈ λ³λ ¬νλ₯Ό νμ©ν λν νΈλμ€ν¬λ¨Έ λͺ¨λΈ νμ΅μ μ¦λͺ
νκΈ° μν΄ ,
248+ # νΈλμ€ν¬λ¨Έ κ³μΈ΅ κ·λͺ¨λ₯Ό μ μ ν νμ₯μν΅λλ€ . We use an embedding
249+ # 4096μ°¨μμ μλ² λ© λ²‘ν°, 4096μ μλ μ¬μ΄μ¦, 16κ°μ μ΄ν
μ
ν€λ( attention head)μ μ΄ 8 κ°μ
250+ # νΈλμ€ν¬λ¨Έ κ³μΈ΅ (``nn.TransformerEncoderLayer``)λ₯Ό μ¬μ©ν©λλ€. μ΄λ μ΅λ
251+ # **~1 μ΅ ** κ°μ νλΌλ―Έν°λ₯Ό κ°λ λͺ¨λΈμ μμ±ν©λλ€ .
253252#
254- # We need to initialize the `RPC Framework <https://pytorch.org/docs/stable/rpc.html>`__
255- # since Pipe depends on the RPC framework via `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__
256- # which allows for future expansion to cross host pipelining. We need to
257- # initialize the RPC framework with only a single worker since we're using a
258- # single process to drive multiple GPUs .
253+ # `RPC νλ μμν¬ <https://pytorch.org/docs/stable/rpc.html>`__ λ₯Ό μ΄κΈ°νν΄μΌ ν©λλ€.
254+ # Pipeλ `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ λ₯Ό ν΅ν΄ RPC νλ μμν¬μ μμ‘΄νλλ°
255+ # μ΄λ ν₯ν νΈμ€νΈ νμ΄νλΌμΈμ κ΅μ°¨ νμ₯ν μ μλλ‘ νκΈ° λλ¬Έμ
λλ€.
256+ # μ΄λ RPC νλ μμν¬λ μ€μ§ νλμ νλμ workerλ‘ μ΄κΈ°νλ₯Ό ν΄μΌ νλλ°,
257+ # μ¬λ¬ GPUλ₯Ό λ€λ£¨κΈ° μν΄ νλ‘μΈμ€ νλλ§ μ¬μ©νκ³ μκΈ° λλ¬Έμ
λλ€ .
259258#
260- # The pipeline is then initialized with 8 transformer layers on one GPU and 8
261- # transformer layers on the other GPU. One pipe is setup across GPUs 0 and 1 and
262- # another across GPUs 2 and 3. Both pipes are then replicated using DistributedDataParallel .
259+ # κ·Έλ° λ€μ νμ΄νλΌμΈμ ν GPUμ 8κ°μ νΈλμ€ν¬λ¨Έμ
260+ # λ€λ₯Έ GPUμ 8κ°μ νΈλμ€ν¬λ¨Έ λ μ΄μ΄λ‘ μ΄κΈ°νλ©λλ€. ν νμ΄νλ GPU 0, 1μ κ±°μ³ μ€μ λκ³
261+ # λ€λ₯Έ νλλ GPU 2, 3μ μ€μ λ©λλ€. κ·Έλ° λ€μ λΆμ° λ°μ΄ν° λ³λ ¬μ μ¬μ©νμ¬ λ νμ΄νκ° λͺ¨λ 볡μ λ©λλ€ .
263262
264263# In 'run_worker'
265264 ntokens = len (vocab ) # the size of vocabulary
@@ -331,21 +330,20 @@ def get_total_params(module: torch.nn.Module):
331330 print_with_rank ('Total parameters in model: {:,}' .format (get_total_params (model )))
332331
333332######################################################################
334- # Run the model
333+ # λͺ¨λΈ μ€ννκΈ°
335334# -------------
336335#
337336
338337
339338######################################################################
340- # `CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
341- # is applied to track the loss and
342- # `SGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>`__
343- # implements stochastic gradient descent method as the optimizer. The initial
344- # learning rate is set to 5.0. `StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>`__ is
345- # applied to adjust the learn rate through epochs. During the
346- # training, we use
339+ # μμ€(loss)μ μΆμ νκΈ° μν΄ `CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__ κ°
340+ # μ μ©λλ©°, μ΅ν°λ§μ΄μ (optimizer)λ‘μ `SGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>`__
341+ # λ νλ₯ μ κ²½μ¬νκ°λ²(stochastic gradient descent method)μ ꡬνν©λλ€. μ΄κΈ°
342+ # νμ΅λ₯ (learning rate)μ 5.0λ‘ μ€μ λ©λλ€. `StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>`__ λ
343+ # μν(epoch)μ λ°λΌμ νμ΅λ₯ μ μ‘°μ νλ λ° μ¬μ©λ©λλ€. νμ΅νλ λμ,
344+ # κΈ°μΈκΈ° νλ°(gradient exploding)μ λ°©μ§νκΈ° μν΄ λͺ¨λ κΈ°μΈκΈ°λ₯Ό ν¨κ» μ‘°μ (scale)νλ ν¨μ
347345# `nn.utils.clip_grad_norm\_ <https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_>`__
348- # function to scale all the gradient together to prevent exploding .
346+ # μ μ΄μ©ν©λλ€ .
349347#
350348
351349# In 'run_worker'
@@ -409,8 +407,8 @@ def evaluate(eval_model, data_source):
409407 return total_loss / (len (data_source ) - 1 )
410408
411409######################################################################
412- # Loop over epochs. Save the model if the validation loss is the best
413- # we've seen so far. Adjust the learning rate after each epoch .
410+ # μνμ λ°λ³΅ν©λλ€. λ§μ½ κ²μ¦ μ€μ°¨( validation loss)κ° μ§κΈκΉμ§ κ΄μ°°ν κ² μ€ μ΅μ μ΄λΌλ©΄
411+ # λͺ¨λΈμ μ μ₯ν©λλ€. κ° μν μ΄νμ νμ΅λ₯ μ μ‘°μ ν©λλ€ .
414412
415413# In 'run_worker'
416414 best_val_loss = float ("inf" )
@@ -435,10 +433,10 @@ def evaluate(eval_model, data_source):
435433
436434
437435######################################################################
438- # Evaluate the model with the test dataset
436+ # νκ° λ°μ΄ν°μ
μΌλ‘ λͺ¨λΈ νκ°νκΈ°
439437# -------------------------------------
440438#
441- # Apply the best model to check the result with the test dataset .
439+ # νκ° λ°μ΄ν°μ
μμμ κ²°κ³Όλ₯Ό νμΈνκΈ° μν΄ μ΅κ³ μ λͺ¨λΈμ μ μ©ν©λλ€ .
442440
443441# In 'run_worker'
444442 test_loss = evaluate (best_model , test_data )
0 commit comments