|
11 | 11 | ๋ณํ๋ ๋์ (accumulation)์ด ํ์ํ์ง ์์ ๊ฒฝ์ฐ๋ผ๋ฉด ๋ง์
๋๋ค) |
12 | 12 | ์ด ํํ ๋ฆฌ์ผ์ ๋ค์ ๋ด์ฉ์ ๋ค๋ฃน๋๋ค: |
13 | 13 |
|
14 | | -1. ํ์ต ๋๋ ๋ฏธ์ธ์กฐ์ (finetuning) ๋ฃจํ ์ค ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฐจ์งํ๋ ์์๋ค, |
| 14 | +1. ํ์ต ๋๋ ๋ฏธ์ธ์กฐ์ (finetuning) ๋จ๊ณ ์ค ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฐจ์งํ๋ ์์๋ค, |
15 | 15 | 2. ๋ฉ๋ชจ๋ฆฌ ์ค๋
์ท(snapshot)์ ์บก์ฒํ๊ณ ์๊ฐํํ์ฌ ๋ณ๋ชฉ ํ์์ ํ์
ํ๋ ๋ฐฉ๋ฒ, |
16 | 16 | 3. ์๋ก์ด ``Tensor.register_post_accumulate_grad_hook(hook)`` API, ๊ทธ๋ฆฌ๊ณ |
17 | 17 | 4. ์ด ๋ชจ๋ ๊ฒ์ ๊ฐ์ํ ๋จ 10์ค์ ์ฝ๋๋ก ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ ์ฝํ๋ ๋ฒ. |
@@ -97,35 +97,35 @@ def train(model, optimizer): |
97 | 97 | # .. figure:: /_static/img/optim_step_in_bwd/snapshot.jpg |
98 | 98 | # :alt: snapshot.png loaded into CUDA Memory Visualizer |
99 | 99 | # |
100 | | -# ๋ชจ๋ธ ํ๋ผ๋ฏธํฐ๋ ์ด๋ฏธ ํ์ต ๋ฃจํ ์ด์ ์ ๋ฉ๋ชจ๋ฆฌ์ ๋ก๋๋์์ผ๋ฏ๋ก, |
101 | | -# ์ฒ์๋ถํฐ ๊ฐ์ค์น(weights)์ ํ ๋น๋ ๋ฉ๋ชจ๋ฆฌ ๋ฉ์ด๋ฆฌ๊ฐ ๋ณด์
๋๋ค. |
102 | | -# ์์ ํ๋ฅผ ์์ํ๋ฉด, ๋ฉ๋ชจ๋ฆฌ๋ ํ์ฑํ ๊ฐ์ ์ํด ์ ์ฐจ ํ ๋น๋ฉ๋๋ค. |
103 | | -# ์ด ํ์ฑํ ๊ฐ์ ์ญ์ ํ ๋จ๊ณ์์ ๋ณํ๋๋ฅผ ๊ณ์ฐํ๊ธฐ ์ํด ์ ์ฅํ๋ tensor์
๋๋ค. |
104 | | -# ์ญ์ ํ๋ฅผ ์์ํ๋ฉด, ํ์ฑํ ๊ฐ์ด ์ ์ฐจ ํด์ ๋๋ฉด์ ๋ณํ๋๊ฐ ์ฐจ์งํ๋ ๋ฉ๋ชจ๋ฆฌ๊ฐ |
105 | | -# ์์ด๊ธฐ ์์ํฉ๋๋ค. |
106 | | -# |
107 | | -# ๋ง์ง๋ง์ผ๋ก ์ตํฐ๋ง์ด์ ๊ฐ ์๋ํ๋ฉด, ์ตํฐ๋ง์ด์ ์ ์ํ๋ ์ง์ฐ ์ด๊ธฐํ(lazily |
108 | | -# initialized)๋๋ฏ๋ก, ์ฒซ ๋ฒ์งธ ํ์ต ๋ฃจํ์ ์ตํฐ๋ง์ด์ ๋จ๊ณ ๋์๋ง ์ตํฐ๋ง์ด์ |
109 | | -# ์ํ ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ ์ฐจ ์ฆ๊ฐํ๋ ๊ฒ์ ๋ณผ ์ ์์ต๋๋ค. ์ดํ์ ๋ฃจํ์์๋, ์ตํฐ๋ง์ด์ |
110 | | -# ๋ฉ๋ชจ๋ฆฌ๊ฐ ๊ทธ๋๋ก ์ ์ง๋๊ณ , ์ ์๋ฆฌ์์ ์
๋ฐ์ดํธ๋ฉ๋๋ค. ๋ณํ๋๊ฐ ์ฐจ์งํ๋ ๋ฉ๋ชจ๋ฆฌ๋ |
111 | | -# ๋งค๋ฒ ํ์ต ๋ฃจํ๊ฐ ๋๋ ๋ ``zero_grad`` ๊ฐ ํธ์ถ๋๋ฉด ์ ์ ํ ํด์ ๋ฉ๋๋ค. |
| 100 | +# The model parameters have already been loaded in memory before the training |
| 101 | +# step, so we see a chunk of memory devoted to the weights right off the bat. |
| 102 | +# As we start our forward pass, memory is allocated gradually for the activations, |
| 103 | +# or the tensors we are saving to be able to compute gradients in the backward pass. |
| 104 | +# Once we start the backward pass, the activations are gradually freed while memory |
| 105 | +# of the gradients starts building up. |
| 106 | +# |
| 107 | +# Lastly, as the optimizer kicks in, its state will be lazily initialized, so we |
| 108 | +# should see the optimizer state memory gradually increase during the optimizer |
| 109 | +# step of the first training loop only. In future loops, the optimizer memory |
| 110 | +# will remain and be updated in-place. The memory for the gradients is then |
| 111 | +# freed accordingly at the end of every training loop when ``zero_grad`` is called. |
112 | 112 | # |
113 | 113 | # ์ด ํ์ต ๋ฃจํ์์ ๋ฉ๋ชจ๋ฆฌ ๋ณ๋ชฉ ํ์์ด ๋ฐ์ํ๋ ์ง์ ์ ์ด๋์ผ๊น์? ์ฆ, ๋ฉ๋ชจ๋ฆฌ |
114 | 114 | # ์ฌ์ฉ์ด ๊ฐ์ฅ ๋์ ์ง์ ์ ์ด๋์ผ๊น์? |
115 | 115 | # |
116 | | -# ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ด ๊ฐ์ฅ ๋์ ์ง์ ์ ์ตํฐ๋ง์ด์ ๋จ๊ณ์
๋๋ค! ์ด๋์ ๋ฉ๋ชจ๋ฆฌ๋ ์์๋๋ก |
117 | | -# ~1.2GB ์ ํ๋ผ๋ฏธํฐ, ~1.2GB์ ๋ณํ๋, ๊ทธ๋ฆฌ๊ณ ~2.4GB=2*1.2GB ์ ์ตํฐ๋ง์ด์ ์ํ๋ก |
118 | | -# ๊ตฌ์ฑ๋ฉ๋๋ค. ๋ง์ง๋ง ~1.2GB๋ Adam ์ตํฐ๋ง์ด์ ๊ฐ ์ค๊ฐ ๋จ๊ณ์ ํ์๋ก ํ๋ ๋ฉ๋ชจ๋ฆฌ๋ก, |
119 | | -# ํฉ์ณ์ ์ด ~6GB์ ๋ฌํฉ๋๋ค. |
120 | | -# ์ฌ์ค, ``Adam(model.parameters(), foreach=False)`` ๋ก ์ค์ ํ๋ฉด ์ตํฐ๋ง์ด์ ์ค๊ฐ |
121 | | -# ๋ฉ๋ชจ๋ฆฌ์ธ ๋ง์ง๋ง 1.2GB๋ฅผ ์ ๊ฑฐํ ์ ์๋๋ฐ, ์ด๋ ๋ฉ๋ชจ๋ฆฌ ๋์ ์คํ ์๊ฐ์ ํฌ์ํ๋ |
122 | | -# ๋ฐฉ์์
๋๋ค. ๋ง์ฝ ์ด ``foreach`` ์ต์ ํ๋ง์ผ๋ก๋ ์ถฉ๋ถํ ํ์ํ๋งํผ ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ ์ฝ๋์๋ค๋ฉด |
123 | | -# ์๋ ์ผ์ด์ง๋ง, ๋ ๋์ ๋ฐฉ๋ฒ์ ๋ํด ์๊ณ ์ถ๋ค๋ฉด ์ด ํํ ๋ฆฌ์ผ์ ๊ณ์ ์ฝ์ด๋ณด์ธ์! |
124 | | -# |
125 | | -# ์ด์ ๊ณง ์๊ฐํ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค๋ฉด, ~1.2GB์ **๋ณํ๋ ๋ฉ๋ชจ๋ฆฌ** ์ **์ตํฐ๋ง์ด์ ์ค๊ฐ |
126 | | -# ๋จ๊ณ ๋ฉ๋ชจ๋ฆฌ** ๊ฐ ํ์ ์๊ฒ ๋์ด ์ต๋ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ๋ฎ์ถ ์ ์์ต๋๋ค. |
127 | | -# ๊ทธ๋ ๋ค๋ฉด, ์๋ก์ด ์ต๋ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ผ๋ง๊ฐ ๋ ๊น์? |
128 | | -# ์ ๋ต์ `๋ค์` ์ค๋
์ท์์ ๊ณต๊ฐ๋ฉ๋๋ค. |
| 116 | +# The peak memory usage is during the optimizer step! Note the memory then |
| 117 | +# consists of ~1.2GB of parameters, ~1.2GB of gradients, and ~2.4GB=2*1.2GB of |
| 118 | +# the optimizer state as expected. The last ~1.2GB comes from Adam optimizer |
| 119 | +# requiring memory for intermediates, totaling to ~6GB of peak memory. |
| 120 | +# Technically, you can remove the need for the last 1.2GB for optimizer |
| 121 | +# intermediates if you set ``Adam(model.parameters(), foreach=False)`` which |
| 122 | +# would trade off runtime for memory. If switching off the ``foreach`` runtime |
| 123 | +# optimization is sufficient in memory savings for you, nice, but please |
| 124 | +# read on if you're curious how this tutorial can help you do better! |
| 125 | +# With the technique we will soon introduce, we will reduce peak memory by |
| 126 | +# removing the need for the ~1.2GB of **gradients memory** as well as **optimizer |
| 127 | +# intermediates memory**. Now, what would you expect the new peak memory to be? |
| 128 | +# The answer will be revealed in the `next` snapshot. |
129 | 129 | # |
130 | 130 | # ์ฃผ์ ์ฌํญ: ์ด ๋ฐฉ๋ฒ์ ๋ชจ๋ ๊ฒฝ์ฐ์ ์ ํฉํ ๊ฒ์ **์๋** |
131 | 131 | # """"""""""""""""""""""""""""""""""""""""""""" |
|
0 commit comments