Skip to content

Commit b908557

Browse files
committed
advanced_source/ddp_pipeline λ²ˆμ—­
1 parent efe11eb commit b908557

File tree

1 file changed

+84
-86
lines changed

1 file changed

+84
-86
lines changed

β€Žadvanced_source/ddp_pipeline.pyβ€Ž

Lines changed: 84 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,36 @@
11
"""
2-
Training Transformer models using Distributed Data Parallel and Pipeline Parallelism
2+
λΆ„μ‚° 데이터 병렬 μ²˜λ¦¬μ™€ νŒŒμ΄ν”„λΌμΈ 병렬화λ₯Ό μ‚¬μš©ν•œ 트랜슀포머 λͺ¨λΈ ν•™μŠ΅
33
====================================================================================
44
55
**Author**: `Pritam Damania <https://github.com/pritamdamania87>`_
6+
**λ²ˆμ—­**: `백선희 <https://github.com/spongebob03>`_
7+
8+
이 νŠœν† λ¦¬μ–Όμ€ `λΆ„μ‚° 데이터 λ³‘λ ¬μ²˜λ¦¬(Distributed Data Parallel) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ 와
9+
`νŒŒμ΄ν”„λΌμΈ 병렬화 <https://pytorch.org/docs/stable/pipeline.html>`__
10+
λ₯Ό μ‚¬μš©ν•˜μ—¬ μ—¬λŸ¬ GPU에 걸친 κ±°λŒ€ν•œ 트랜슀포머(transformer) λͺ¨λΈμ„ μ–΄λ–»κ²Œ ν•™μŠ΅μ‹œν‚€λŠ”μ§€ λ³΄μ—¬μ€λ‹ˆλ‹€.
11+
이번 νŠœν† λ¦¬μ–Όμ€ `NN.TRANSFORMER 와 TORCHTEXT 둜 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§ν•˜κΈ° <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ 의
12+
ν™•μž₯판이며 νŒŒμ΄ν”„λΌμΈ 병렬화가 μ–΄λ–»κ²Œ 트랜슀포머 λͺ¨λΈ ν•™μŠ΅μ— μ“°μ΄λŠ”μ§€ 증λͺ…ν•˜κΈ° μœ„ν•΄ 이전 νŠœν† λ¦¬μ–Όμ—μ„œμ˜
13+
λͺ¨λΈ 규λͺ¨λ₯Ό μ¦κ°€μ‹œμΌ°μŠ΅λ‹ˆλ‹€.
614
7-
This tutorial demonstrates how to train a large Transformer model across
8-
multiple GPUs using `Distributed Data Parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ and
9-
`Pipeline Parallelism <https://pytorch.org/docs/stable/pipeline.html>`__. This tutorial is an extension of the
10-
`Sequence-to-Sequence Modeling with nn.Transformer and TorchText <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ tutorial
11-
and scales up the same model to demonstrate how Distributed Data Parallel and
12-
Pipeline Parallelism can be used to train Transformer models.
13-
14-
Prerequisites:
15+
μ„ μˆ˜κ³Όλͺ©(Prerequisites):
1516
1617
* `Pipeline Parallelism <https://pytorch.org/docs/stable/pipeline.html>`__
17-
* `Sequence-to-Sequence Modeling with nn.Transformer and TorchText <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__
18-
* `Getting Started with Distributed Data Parallel <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__
18+
* `NN.TRANSFORMER 와 TORCHTEXT 둜 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§ν•˜κΈ° <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__
19+
* `λΆ„μ‚° 데이터 병렬 처리 μ‹œμž‘ν•˜κΈ° <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__
1920
"""
2021

2122

2223
######################################################################
23-
# Define the model
24+
# λͺ¨λΈ μ •μ˜ν•˜κΈ°
2425
# ----------------
2526
#
2627

2728
######################################################################
28-
# ``PositionalEncoding`` module injects some information about the
29-
# relative or absolute position of the tokens in the sequence. The
30-
# positional encodings have the same dimension as the embeddings so that
31-
# the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
32-
# different frequencies.
29+
# ``PositionalEncoding`` λͺ¨λ“ˆμ€ μ‹œν€€μŠ€μ—μ„œ ν† ν°μ˜ μƒλŒ€μ , μ ˆλŒ€ μœ„μΉ˜μ— λŒ€ν•œ
30+
# 일뢀 정보λ₯Ό μ£Όμž…ν•©λ‹ˆλ‹€.
31+
# μœ„μΉ˜ 인코딩은 μž„λ² λ”©κ³Ό 같은 차원을 κ°€μ§€λ―€λ‘œ
32+
# λ‘˜μ„ ν•©μΉ  수 μžˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ„œ, μ£ΌνŒŒμˆ˜κ°€ λ‹€λ₯Έ ``sine``κ³Ό ``cosine`` κΈ°λŠ₯을
33+
# μ‚¬μš©ν•©λ‹ˆλ‹€.
3334

3435
import sys
3536
import os
@@ -60,23 +61,23 @@ def forward(self, x):
6061

6162

6263
######################################################################
63-
# In this tutorial, we will split a Transformer model across two GPUs and use
64-
# pipeline parallelism to train the model. In addition to this, we use
65-
# `Distributed Data Parallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
66-
# to train two replicas of this pipeline. We have one process driving a pipe across
67-
# GPUs 0 and 1 and another process driving a pipe across GPUs 2 and 3. Both these
68-
# processes then use Distributed Data Parallel to train the two replicas. The
69-
# model is exactly the same model used in the `Sequence-to-Sequence Modeling with nn.Transformer and TorchText
70-
# <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ tutorial,
71-
# but is split into two stages. The largest number of parameters belong to the
72-
# `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ layer.
73-
# The `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__
74-
# itself consists of ``nlayers`` of `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
75-
# As a result, our focus is on ``nn.TransformerEncoder`` and we split the model
76-
# such that half of the ``nn.TransformerEncoderLayer`` are on one GPU and the
77-
# other half are on another. To do this, we pull out the ``Encoder`` and
78-
# ``Decoder`` sections into seperate modules and then build an nn.Sequential
79-
# representing the original Transformer module.
64+
# 이번 νŠœν† λ¦¬μ–Όμ—μ„œλŠ”, 트랜슀포머 λͺ¨λΈμ„ 두 개의 GPU에 κ±Έμ³μ„œ λ‚˜λˆ„κ³ 
65+
# νŒŒμ΄ν”„λΌμΈ λ³‘λ ¬ν™”λ‘œ ν•™μŠ΅μ‹œμΌœ λ³΄κ² μŠ΅λ‹ˆλ‹€. μΆ”κ°€λ‘œ,
66+
# `λΆ„μ‚° 데이터 λ³‘λŸ΄ 처리 <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
67+
# λ₯Ό μ‚¬μš©ν•˜μ—¬ 이 νŒŒμ΄ν”„λΌμΈμ˜ 두 볡제λ₯Ό ν›ˆλ ¨μ‹œν‚΅λ‹ˆλ‹€. ν•œ ν”„λ‘œμ„ΈμŠ€λŠ”
68+
# GPUs 0, 1에 거쳐 νŒŒμ΄ν”„λ₯Ό κ΅¬λ™ν•˜κ³  λ‹€λ₯Έ ν”„λ‘œμ„ΈμŠ€λŠ” GPUs 2, 3μ—μ„œ νŒŒμ΄ν”„λ₯Ό κ΅¬λ™ν•©λ‹ˆλ‹€. κ·Έ λ‹€μŒ, 이 두
69+
# ν”„λ‘œμ„ΈμŠ€λŠ” λΆ„μ‚° 데이터 λ³‘λ ¬μ²˜λ¦¬λ‘œ 두 λ³΅μ œλ³Έμ„ ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€.
70+
# λͺ¨λΈμ€ λ°”λ‘œ `NN.TRANSFORMER 와 TORCHTEXT 둜 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€(SEQUENCE-TO-SEQUENCE) λͺ¨λΈλ§ν•˜κΈ°
71+
# <https://tutorials.pytorch.kr/beginner/transformer_tutorial.html>`__ νŠœν† λ¦¬μ–Όκ³Ό
72+
# λ˜‘κ°™μ€ λͺ¨λΈμ΄μ§€λ§Œ 두 λ‹¨κ³„λ‘œ λ‚˜λ‰©λ‹ˆλ‹€. λŒ€λΆ€λΆ„ νŒŒλΌλ―Έν„°(parameter)듀은
73+
# `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ 계측(layer)에 ν¬ν•¨λ©λ‹ˆλ‹€.
74+
# `nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__ λŠ”
75+
# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__ 의 ``nlayers`` 둜 κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
76+
# 결과적으둜, μš°λ¦¬λŠ” ``nn.TransformerEncoder`` 에 쀑점을 두고 있으며
77+
# ``nn.TransformerEncoderLayer`` 의 μ ˆλ°˜μ€ ν•œ GPU에 두고
78+
# λ‚˜λ¨Έμ§€ μ ˆλ°˜μ€ λ‹€λ₯Έ GPU에 μžˆλ„λ‘ λͺ¨λΈμ„ λΆ„ν• ν•©λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄μ„œ ``Encoder`` 와
79+
# ``Decoder`` μ„Ήμ…˜μ„ λΆ„λ¦¬λœ λͺ¨λ“ˆλ‘œ λΉΌλ‚Έ λ‹€μŒ, 원본 트랜슀포머 λͺ¨λ“ˆμ„
80+
# λ‚˜νƒ€λ‚΄λŠ” nn.Sequential을 λΉŒλ“œ ν•©λ‹ˆλ‹€.
8081

8182

8283
if sys.platform == 'win32':
@@ -120,33 +121,31 @@ def forward(self, inp):
120121
return self.decoder(inp).permute(1, 0, 2)
121122

122123
######################################################################
123-
# Start multiple processes for training
124+
# ν•™μŠ΅μ„ μœ„ν•œ 닀쀑 ν”„λ‘œμ„ΈμŠ€ μ‹œμž‘
124125
# -------------------------------------
125126
#
126127

127128

128129
######################################################################
129-
# We start two processes where each process drives its own pipeline across two
130-
# GPUs. ``run_worker`` is executed for each process.
130+
# 각자 두 개의 GPUμ—μ„œ 자체 νŒŒμ΄ν”„λΌμΈμ„ κ΅¬λ™ν•˜λŠ” 두 κ°€μ§€ ν”„λ‘œμ„ΈμŠ€λ₯Ό μ‹œμž‘ν•©λ‹ˆλ‹€.
131+
# ``run_worker``λŠ” 각 ν”„λ‘œμ„ΈμŠ€μ— μ‹€ν–‰λ©λ‹ˆλ‹€.
131132

132133
def run_worker(rank, world_size):
133134

134135

135136
######################################################################
136-
# Load and batch data
137+
# 데이터 λ‘œλ“œν•˜κ³  배치 λ§Œλ“€κΈ°
137138
# -------------------
138139
#
139140

140141

141142
######################################################################
142-
# The training process uses Wikitext-2 dataset from ``torchtext``. The
143-
# vocab object is built based on the train dataset and is used to numericalize
144-
# tokens into tensors. Starting from sequential data, the ``batchify()``
145-
# function arranges the dataset into columns, trimming off any tokens remaining
146-
# after the data has been divided into batches of size ``batch_size``.
147-
# For instance, with the alphabet as the sequence (total length of 26)
148-
# and a batch size of 4, we would divide the alphabet into 4 sequences of
149-
# length 6:
143+
# ν•™μŠ΅ ν”„λ‘œμ„ΈμŠ€λŠ” ``torchtext`` 의 Wikitext-2 데이터셋을 μ‚¬μš©ν•©λ‹ˆλ‹€.
144+
# 단어 μ˜€λΈŒμ νŠΈλŠ” ν›ˆλ ¨ λ°μ΄ν„°μ…‹μœΌλ‘œ λ§Œλ“€μ–΄μ§€κ³ , 토큰을 ν…μ„œ(tensor)둜 μˆ˜μΉ˜ν™”ν•˜λŠ”λ° μ‚¬μš©λ©λ‹ˆλ‹€.
145+
# μ‹œν€€μŠ€ λ°μ΄ν„°λ‘œλΆ€ν„° μ‹œμž‘ν•˜μ—¬, ``batchify()`` ν•¨μˆ˜λŠ” 데이터셋을 μ—΄(column)λ“€λ‘œ μ •λ¦¬ν•˜κ³ ,
146+
# ``batch_size`` μ‚¬μ΄μ¦ˆμ˜ λ°°μΉ˜λ“€λ‘œ λ‚˜λˆˆ 후에 남은 λͺ¨λ“  토큰을 λ²„λ¦½λ‹ˆλ‹€.
147+
# 예λ₯Ό λ“€μ–΄, μ•ŒνŒŒλ²³μ„ μ‹œν€€μŠ€(총 길이 26)둜 μƒκ°ν•˜κ³  배치 μ‚¬μ΄μ¦ˆλ₯Ό 4라고 ν•œλ‹€λ©΄,
148+
# μ•ŒνŒŒλ²³μ„ 길이가 6인 4개의 μ‹œν€€μŠ€λ‘œ λ‚˜λˆŒ 수 μžˆμŠ΅λ‹ˆλ‹€:
150149
#
151150
# .. math::
152151
# \begin{bmatrix}
@@ -160,9 +159,9 @@ def run_worker(rank, world_size):
160159
# \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
161160
# \end{bmatrix}
162161
#
163-
# These columns are treated as independent by the model, which means that
164-
# the dependence of ``G`` and ``F`` can not be learned, but allows more
165-
# efficient batch processing.
162+
# 이 열듀은 λͺ¨λΈμ— μ˜ν•΄μ„œ λ…λ¦½μ μœΌλ‘œ μ·¨κΈ‰λ˜λ©°, μ΄λŠ”
163+
# ``G`` 와 ``F`` 의 μ˜μ‘΄μ„±μ΄ ν•™μŠ΅λ  수 μ—†λ‹€λŠ” 것을 μ˜λ―Έν•˜μ§€λ§Œ, 더 효율적인
164+
# 배치 ν”„λ‘œμ„Έμ‹±(batch processing)을 ν—ˆμš©ν•©λ‹ˆλ‹€.
166165
#
167166

168167
# In 'run_worker'
@@ -210,23 +209,23 @@ def batchify(data, bsz, rank, world_size, is_train=False):
210209

211210

212211
######################################################################
213-
# Functions to generate input and target sequence
212+
# μž…λ ₯κ³Ό νƒ€κ²Ÿ μ‹œν€€μŠ€λ₯Ό μƒμ„±ν•˜κΈ° μœ„ν•œ ν•¨μˆ˜λ“€
214213
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215214
#
216215

217216

218217
######################################################################
219-
# ``get_batch()`` function generates the input and target sequence for
220-
# the transformer model. It subdivides the source data into chunks of
221-
# length ``bptt``. For the language modeling task, the model needs the
222-
# following words as ``Target``. For example, with a ``bptt`` value of 2,
223-
# we’d get the following two Variables for ``i`` = 0:
218+
# ``get_batch()`` ν•¨μˆ˜λŠ” 트랜슀포머 λͺ¨λΈμ„ μœ„ν•œ μž…λ ₯κ³Ό νƒ€κ²Ÿ μ‹œν€€μŠ€λ₯Ό
219+
# μƒμ„±ν•©λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” μ†ŒμŠ€ 데이터λ₯Ό ``bptt`` 길이λ₯Ό κ°€μ§„ λ©μ–΄λ¦¬λ‘œ μ„ΈλΆ„ν™”ν•©λ‹ˆλ‹€.
220+
# μ–Έμ–΄ λͺ¨λΈλ§ 과제λ₯Ό μœ„ν•΄μ„œ, λͺ¨λΈμ€
221+
# λ‹€μŒ 단어인 ``Target`` 이 ν•„μš”ν•©λ‹ˆλ‹€. 에λ₯Ό λ“€μ–΄ ``bptt`` 의 값이 2라면,
222+
# ``i`` = 0 일 λ•Œ λ‹€μŒμ˜ 2 개 λ³€μˆ˜(Variable)λ₯Ό 얻을 수 μžˆμŠ΅λ‹ˆλ‹€:
224223
#
225224
# .. image:: ../_static/img/transformer_input_target.png
226-
#
227-
# It should be noted that the chunks are along dimension 0, consistent
228-
# with the ``S`` dimension in the Transformer model. The batch dimension
229-
# ``N`` is along dimension 1.
225+
#
226+
# 청크가 차원 0을 μ†ν•˜λ©°
227+
# 트랜슀포머 λͺ¨λΈμ˜ ''S'' 차원과 μΌμΉ˜ν•œλ‹€λŠ” 것을 μœ μ˜ν•΄μ•Ό ν•©λ‹ˆλ‹€.
228+
# 배치 차원 ``N`` 은 1 차원에 ν•΄λ‹Ήν•©λ‹ˆλ‹€.
230229
#
231230

232231
# In 'run_worker'
@@ -239,27 +238,27 @@ def get_batch(source, i):
239238
return data.t(), target
240239

241240
######################################################################
242-
# Model scale and Pipe initialization
241+
# λͺ¨λΈ 규λͺ¨μ™€ νŒŒμ΄ν”„ μ΄ˆκΈ°ν™”
243242
# -----------------------------------
244243
#
245244

246245

247246
######################################################################
248-
# To demonstrate training large Transformer models using pipeline parallelism,
249-
# we scale up the Transformer layers appropriately. We use an embedding
250-
# dimension of 4096, hidden size of 4096, 16 attention heads and 8 total
251-
# transformer layers (``nn.TransformerEncoderLayer``). This creates a model with
252-
# **~1 billion** parameters.
247+
# νŒŒμ΄ν”„λΌμΈ 병렬화λ₯Ό ν™œμš©ν•œ λŒ€ν˜• 트랜슀포머 λͺ¨λΈ ν•™μŠ΅μ„ 증λͺ…ν•˜κΈ° μœ„ν•΄,
248+
# 트랜슀포머 계측 규λͺ¨λ₯Ό 적절히 ν™•μž₯μ‹œν‚΅λ‹ˆλ‹€. We use an embedding
249+
# 4096μ°¨μ›μ˜ μž„λ² λ”© 벑터, 4096의 은닉 μ‚¬μ΄μ¦ˆ, 16개의 μ–΄ν…μ…˜ ν—€λ“œ(attention head)와 총 8 개의
250+
# 트랜슀포머 계측 (``nn.TransformerEncoderLayer``)λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄λŠ” μ΅œλŒ€
251+
# **~1 μ–΅** 개의 νŒŒλΌλ―Έν„°λ₯Ό κ°–λŠ” λͺ¨λΈμ„ μƒμ„±ν•©λ‹ˆλ‹€.
253252
#
254-
# We need to initialize the `RPC Framework <https://pytorch.org/docs/stable/rpc.html>`__
255-
# since Pipe depends on the RPC framework via `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__
256-
# which allows for future expansion to cross host pipelining. We need to
257-
# initialize the RPC framework with only a single worker since we're using a
258-
# single process to drive multiple GPUs.
253+
# `RPC ν”„λ ˆμž„μ›Œν¬ <https://pytorch.org/docs/stable/rpc.html>`__ λ₯Ό μ΄ˆκΈ°ν™”ν•΄μ•Ό ν•©λ‹ˆλ‹€.
254+
# PipeλŠ” `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ λ₯Ό 톡해 RPC ν”„λ ˆμž„μ›Œν¬μ— μ˜μ‘΄ν•˜λŠ”λ°
255+
# μ΄λŠ” ν–₯ν›„ 호슀트 νŒŒμ΄ν”„λΌμΈμ„ ꡐ차 ν™•μž₯ν•  수 μžˆλ„λ‘ ν•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.
256+
# μ΄λ•Œ RPC ν”„λ ˆμž„μ›Œν¬λŠ” 였직 ν•˜λ‚˜μ˜ ν•˜λ‚˜μ˜ worker둜 μ΄ˆκΈ°ν™”λ₯Ό ν•΄μ•Ό ν•˜λŠ”λ°,
257+
# μ—¬λŸ¬ GPUλ₯Ό 닀루기 μœ„ν•΄ ν”„λ‘œμ„ΈμŠ€ ν•˜λ‚˜λ§Œ μ‚¬μš©ν•˜κ³  있기 λ•Œλ¬Έμž…λ‹ˆλ‹€.
259258
#
260-
# The pipeline is then initialized with 8 transformer layers on one GPU and 8
261-
# transformer layers on the other GPU. One pipe is setup across GPUs 0 and 1 and
262-
# another across GPUs 2 and 3. Both pipes are then replicated using DistributedDataParallel.
259+
# 그런 λ‹€μŒ νŒŒμ΄ν”„λΌμΈμ€ ν•œ GPU에 8개의 νŠΈλžœμŠ€ν¬λ¨Έμ™€
260+
# λ‹€λ₯Έ GPU에 8개의 트랜슀포머 λ ˆμ΄μ–΄λ‘œ μ΄ˆκΈ°ν™”λ©λ‹ˆλ‹€. ν•œ νŒŒμ΄ν”„λŠ” GPU 0, 1에 거쳐 μ„€μ •λ˜κ³ 
261+
# λ‹€λ₯Έ ν•˜λ‚˜λŠ” GPU 2, 3에 μ„€μ •λ©λ‹ˆλ‹€. 그런 λ‹€μŒ λΆ„μ‚° 데이터 병렬을 μ‚¬μš©ν•˜μ—¬ 두 νŒŒμ΄ν”„κ°€ λͺ¨λ‘ λ³΅μ œλ©λ‹ˆλ‹€.
263262

264263
# In 'run_worker'
265264
ntokens = len(vocab) # the size of vocabulary
@@ -331,21 +330,20 @@ def get_total_params(module: torch.nn.Module):
331330
print_with_rank('Total parameters in model: {:,}'.format(get_total_params(model)))
332331

333332
######################################################################
334-
# Run the model
333+
# λͺ¨λΈ μ‹€ν–‰ν•˜κΈ°
335334
# -------------
336335
#
337336

338337

339338
######################################################################
340-
# `CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
341-
# is applied to track the loss and
342-
# `SGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>`__
343-
# implements stochastic gradient descent method as the optimizer. The initial
344-
# learning rate is set to 5.0. `StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>`__ is
345-
# applied to adjust the learn rate through epochs. During the
346-
# training, we use
339+
# 손싀(loss)을 μΆ”μ ν•˜κΈ° μœ„ν•΄ `CrossEntropyLoss <https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__ κ°€
340+
# 적용되며, μ˜΅ν‹°λ§ˆμ΄μ €(optimizer)λ‘œμ„œ `SGD <https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD>`__
341+
# λŠ” ν™•λ₯ μ  κ²½μ‚¬ν•˜κ°•λ²•(stochastic gradient descent method)을 κ΅¬ν˜„ν•©λ‹ˆλ‹€. 초기
342+
# ν•™μŠ΅λ₯ (learning rate)은 5.0둜 μ„€μ •λ©λ‹ˆλ‹€. `StepLR <https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR>`__ λŠ”
343+
# 에폭(epoch)에 λ”°λΌμ„œ ν•™μŠ΅λ₯ μ„ μ‘°μ ˆν•˜λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€. ν•™μŠ΅ν•˜λŠ” λ™μ•ˆ,
344+
# 기울기 폭발(gradient exploding)을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ λͺ¨λ“  기울기λ₯Ό ν•¨κ»˜ μ‘°μ •(scale)ν•˜λŠ” ν•¨μˆ˜
347345
# `nn.utils.clip_grad_norm\_ <https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_>`__
348-
# function to scale all the gradient together to prevent exploding.
346+
# 을 μ΄μš©ν•©λ‹ˆλ‹€.
349347
#
350348

351349
# In 'run_worker'
@@ -409,8 +407,8 @@ def evaluate(eval_model, data_source):
409407
return total_loss / (len(data_source) - 1)
410408

411409
######################################################################
412-
# Loop over epochs. Save the model if the validation loss is the best
413-
# we've seen so far. Adjust the learning rate after each epoch.
410+
# 에폭을 λ°˜λ³΅ν•©λ‹ˆλ‹€. λ§Œμ•½ 검증 였차(validation loss)κ°€ μ§€κΈˆκΉŒμ§€ κ΄€μ°°ν•œ 것 쀑 졜적이라면
411+
# λͺ¨λΈμ„ μ €μž₯ν•©λ‹ˆλ‹€. 각 에폭 이후에 ν•™μŠ΅λ₯ μ„ μ‘°μ ˆν•©λ‹ˆλ‹€.
414412

415413
# In 'run_worker'
416414
best_val_loss = float("inf")
@@ -435,10 +433,10 @@ def evaluate(eval_model, data_source):
435433

436434

437435
######################################################################
438-
# Evaluate the model with the test dataset
436+
# 평가 λ°μ΄ν„°μ…‹μœΌλ‘œ λͺ¨λΈ ν‰κ°€ν•˜κΈ°
439437
# -------------------------------------
440438
#
441-
# Apply the best model to check the result with the test dataset.
439+
# 평가 λ°μ΄ν„°μ…‹μ—μ„œμ˜ κ²°κ³Όλ₯Ό ν™•μΈν•˜κΈ° μœ„ν•΄ 졜고의 λͺ¨λΈμ„ μ μš©ν•©λ‹ˆλ‹€.
442440

443441
# In 'run_worker'
444442
test_loss = evaluate(best_model, test_data)

0 commit comments

Comments
Β (0)