|
1 | 1 | """ |
2 | 2 |
|
3 | | -์ตํฐ๋ง์ด์ ๋จ๊ณ๋ฅผ ์ญ์ ํ ๊ณผ์ ์ ํฉ์ณ์ ๋ฉ๋ชจ๋ฆฌ ์ ์ฝํ๊ธฐ |
| 3 | +How to save memory by fusing the optimizer step into the backward pass |
4 | 4 | ====================================================================== |
5 | 5 |
|
6 | | -์๋
ํ์ธ์! ์ด ํํ ๋ฆฌ์ผ์์๋ *๋ณํ๋(gradient)* ๊ฐ ์ฐจ์งํ๋ ๋ฉ๋ชจ๋ฆฌ๋ฅผ |
7 | | -์ค์์ผ๋ก์จ ํ์ต ๋ฃจํ(training loop)์์์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ค์ด๋ ํ ๊ฐ์ง |
8 | | -๋ฐฉ๋ฒ์ ์๊ฐํฉ๋๋ค. ๋ชจ๋ธ์ด ์๋ ์ํฉ์์ ๋ฉ๋ชจ๋ฆฌ ๋ถ์กฑ(Out of Memory, OOM) |
9 | | -์ค๋ฅ๋ฅผ ๋ฐฉ์งํ๊ณ ์ถ๊ฑฐ๋, GPU์ ์ฑ๋ฅ์ ์ต๋ํ ํ์ฉํ๊ณ ์ถ์ ๊ฒฝ์ฐ ์ด ๋ฐฉ๋ฒ์ด |
10 | | -๋์์ด ๋ ์ ์์ต๋๋ค. (๋ณํ๋๊ฐ ๋ฉ๋ชจ๋ฆฌ์ ์ผ๋ถ(partition)๋ฅผ ์ฐจ์งํ๊ณ ์์ผ๋ฉฐ, |
11 | | -๋ณํ๋ ๋์ (accumulation)์ด ํ์ํ์ง ์์ ๊ฒฝ์ฐ๋ผ๋ฉด ๋ง์
๋๋ค) |
12 | | -์ด ํํ ๋ฆฌ์ผ์ ๋ค์ ๋ด์ฉ์ ๋ค๋ฃน๋๋ค: |
13 | | -
|
14 | | -1. ํ์ต ๋๋ ๋ฏธ์ธ์กฐ์ (finetuning) ๋จ๊ณ ์ค ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฐจ์งํ๋ ์์๋ค, |
15 | | -2. ๋ฉ๋ชจ๋ฆฌ ์ค๋
์ท(snapshot)์ ์บก์ฒํ๊ณ ์๊ฐํํ์ฌ ๋ณ๋ชฉ ํ์์ ํ์
ํ๋ ๋ฐฉ๋ฒ, |
16 | | -3. ์๋ก์ด ``Tensor.register_post_accumulate_grad_hook(hook)`` API, ๊ทธ๋ฆฌ๊ณ |
17 | | -4. ์ด ๋ชจ๋ ๊ฒ์ ๊ฐ์ํ ๋จ 10์ค์ ์ฝ๋๋ก ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ ์ฝํ๋ ๋ฒ. |
18 | | -
|
19 | | -์ด ํํ ๋ฆฌ์ผ์ ์คํํ๊ธฐ ์ํด ํ์ํ ๊ฒ: |
20 | | -
|
21 | | -* 2.1.0 ํน์ ๊ทธ ์ด์์ ๋ฒ์ ์ PyTorch์ ``torchvision`` |
22 | | -* ๋ฉ๋ชจ๋ฆฌ ์๊ฐํ๋ฅผ ๋ก์ปฌ์์ ์คํํ๋ ค๋ฉด, CUDA GPU 1๊ฐ. |
23 | | - ๋ฉ๋ชจ๋ฆฌ ์๊ฐํ๋ฅผ ์ ์ธํ๋ฉด ์ด ๋ฐฉ๋ฒ์ ๋ชจ๋ ์ฅ์น์์ ์ ์ฌํ ์ด์ ์ ์ ๊ณตํฉ๋๋ค. |
24 | | -
|
25 | | -๋จผ์ ํ์ํ ๋ชจ๋๊ณผ ๋ชจ๋ธ์ import ํ๊ฒ ์ต๋๋ค. ์ฌ๊ธฐ์์๋ torchvision์ ๋น์ |
26 | | -ํธ๋์คํฌ๋จธ ๋ชจ๋ธ์ ์ฌ์ฉํ์ง๋ง, ๋ค๋ฅธ ๋ชจ๋ธ๋ก ๋์ฒดํด๋ ์ข์ต๋๋ค. ๋ ์ตํฐ๋ง์ด์ ๋ก |
27 | | -``torch.optim.Adam`` ์ ์ฌ์ฉํ์ง๋ง, ๋ง์ฐฌ๊ฐ์ง๋ก ๋ค๋ฅธ ์ตํฐ๋ง์ด์ ๋ก ๋์ฒดํด๋ |
28 | | -๋ฉ๋๋ค. |
| 6 | +Hello there! This tutorial aims to showcase one way of reducing the |
| 7 | +memory footprint of a training loop by reducing the memory taken by |
| 8 | +the *gradients*. Say you have a model and you're interested in ways to |
| 9 | +optimize memory to avoid ``Out of Memory`` (OOM) errors or simply to ooze |
| 10 | +more out of your GPU. Well, you _might_ be in luck (if gradients take up |
| 11 | +a portion of your memory and you do not need to do gradient accumulation). |
| 12 | +We will explore the following: |
| 13 | +
|
| 14 | +1. What takes up memory during your training or finetuning loop, |
| 15 | +2. How to capture and visualize memory snapshots to determine the bottleneck, |
| 16 | +3. The new ``Tensor.register_post_accumulate_grad_hook(hook)`` API, and finally, |
| 17 | +4. How everything fits together in 10 lines to achieve memory savings. |
| 18 | +
|
| 19 | +To run this tutorial, you will need: |
| 20 | +
|
| 21 | +* PyTorch 2.1.0 or newer with ``torchvision`` |
| 22 | +* 1 CUDA GPU if you'd like to run the memory visualizations locally. |
| 23 | + Otherwise, this technique would benefit similarly on any device. |
| 24 | +
|
| 25 | +Let us start by importing the required modules and models. We will use a |
| 26 | +vision transformer model from torchvision, but feel free to substitute |
| 27 | +with your own model. We will also use ``torch.optim.Adam`` as our optimizer, |
| 28 | +but, again, feel free to substitute with your own optimizer. |
29 | 29 |
|
30 | 30 | """ |
31 | 31 |
|
|
0 commit comments