|
1 | 1 | """ |
2 | 2 |
|
3 | | -How to save memory by fusing the optimizer step into the backward pass |
| 3 | +์ตํฐ๋ง์ด์ ๋จ๊ณ๋ฅผ ์ญ์ ํ ๊ณผ์ ์ ํฉ์ณ์ ๋ฉ๋ชจ๋ฆฌ ์ ์ฝํ๊ธฐ |
4 | 4 | ====================================================================== |
5 | 5 |
|
6 | | -Hello there! This tutorial aims to showcase one way of reducing the |
7 | | -memory footprint of a training loop by reducing the memory taken by |
8 | | -the *gradients*. Say you have a model and you're interested in ways to |
9 | | -optimize memory to avoid ``Out of Memory`` (OOM) errors or simply to ooze |
10 | | -more out of your GPU. Well, you _might_ be in luck (if gradients take up |
11 | | -a portion of your memory and you do not need to do gradient accumulation). |
12 | | -We will explore the following: |
13 | | -
|
14 | | -1. What takes up memory during your training or finetuning loop, |
15 | | -2. How to capture and visualize memory snapshots to determine the bottleneck, |
16 | | -3. The new ``Tensor.register_post_accumulate_grad_hook(hook)`` API, and finally, |
17 | | -4. How everything fits together in 10 lines to achieve memory savings. |
18 | | -
|
19 | | -To run this tutorial, you will need: |
20 | | -
|
21 | | -* PyTorch 2.1.0 or newer with ``torchvision`` |
22 | | -* 1 CUDA GPU if you'd like to run the memory visualizations locally. |
23 | | - Otherwise, this technique would benefit similarly on any device. |
24 | | -
|
25 | | -Let us start by importing the required modules and models. We will use a |
26 | | -vision transformer model from torchvision, but feel free to substitute |
27 | | -with your own model. We will also use ``torch.optim.Adam`` as our optimizer, |
28 | | -but, again, feel free to substitute with your own optimizer. |
| 6 | +์๋
ํ์ธ์! ์ด ํํ ๋ฆฌ์ผ์์๋ *๋ณํ๋(gradient)*๊ฐ ์ฐจ์งํ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ค์์ผ๋ก์จ |
| 7 | +ํ์ต ๋จ๊ณ(training loop)์์์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ค์ด๋ ํ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์๊ฐํฉ๋๋ค. |
| 8 | +๋ชจ๋ธ์ ๊ฐ๊ณ ์๋ ์ํฉ์์ ๋ฉ๋ชจ๋ฆฌ ๋ถ์กฑ(Out of Memory, OOM) ์ค๋ฅ๋ฅผ ๋ฐฉ์งํ๊ณ ์ถ๊ฑฐ๋, |
| 9 | +GPU์ ์ฑ๋ฅ์ ์ต๋ํ ํ์ฉํ๊ณ ์ถ์ ๊ฒฝ์ฐ ์ด ๋ฐฉ๋ฒ์ด ๋์์ด ๋ ์ ์์ต๋๋ค. |
| 10 | +(๋ณํ๋๊ฐ ๋ฉ๋ชจ๋ฆฌ์ ์ผ๋ถ๋ฅผ ์ฐจ์งํ๊ณ ์์ผ๋ฉฐ, ๋ณํ๋ ๋์ ์ด ํ์ํ์ง ์์ ๊ฒฝ์ฐ๋ผ๋ฉด ๋ง์
๋๋ค.) |
| 11 | +
|
| 12 | +์๋ ๋ด์ฉ์ ๋ค๋ฃน๋๋ค: |
| 13 | +
|
| 14 | +1. ํ์ต ๋๋ ๋ฏธ์ธ์กฐ์ (finetuning) ๋จ๊ณ ์ค ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฐจ์งํ๋ ์์๋ค, |
| 15 | +2. ๋ฉ๋ชจ๋ฆฌ ์ค๋
์ท์ ์บก์ฒํ๊ณ ์๊ฐํํ์ฌ ๋ณ๋ชฉ ํ์์ ํ์
ํ๋ ๋ฐฉ๋ฒ, |
| 16 | +3. ์๋ก์ด ``Tensor.register_post_accumulate_grad_hook(hook)`` API, ๊ทธ๋ฆฌ๊ณ |
| 17 | +4. ๋จ 10์ค์ ์ฝ๋๋ก ๋ฉ๋ชจ๋ฆฌ ์ ์ฝํ๋ ๋ฒ. |
| 18 | +
|
| 19 | +์ด ํํ ๋ฆฌ์ผ์ ์คํํ๋ ค๋ฉด ๋ค์์ด ํ์ํฉ๋๋ค: |
| 20 | +
|
| 21 | +* ``torchvision``์ด ํฌํจ๋ PyTorch 2.1.0 ํน์ ๊ทธ ์ด์์ ๋ฒ์ |
| 22 | +* ๋ฉ๋ชจ๋ฆฌ ์๊ฐํ๋ฅผ ๋ก์ปฌ์์ ์คํํ๋ ค๋ฉด CUDA GPU 1๊ฐ |
| 23 | + ๋ฉ๋ชจ๋ฆฌ ์๊ฐํ๋ฅผ ์ ์ธํ๋ฉด ์ด ๊ธฐ๋ฒ์ ๋ชจ๋ ์ฅ์น์์ ์ ์ฌํ ์ด์ ์ ์ ๊ณตํฉ๋๋ค. |
| 24 | +
|
| 25 | +๋จผ์ ํ์ํ ๋ชจ๋๊ณผ ๋ชจ๋ธ์ importํ๊ฒ ์ต๋๋ค. |
| 26 | +์์์์๋ torchvision์ ๋น์ ํธ๋์คํฌ๋จธ ๋ชจ๋ธ์ ์ฌ์ฉํ์ง๋ง, ์ํ๋ ๋ชจ๋ธ๋ก ๋์ฒดํด๋ ์ข์ต๋๋ค. |
| 27 | +๋ ์์์์๋ ์ตํฐ๋ง์ด์ ๋ก ``torch.optim.Adam``์ ์ฌ์ฉํ์ง๋ง, ๋ง์ฐฌ๊ฐ์ง๋ก ์ํ๋ ์ตํฐ๋ง์ด์ ๋ก ๋์ฒดํด๋ ๋ฉ๋๋ค. |
| 28 | +
|
29 | 29 |
|
30 | 30 | """ |
31 | 31 |
|
|
0 commit comments