Skip to content

Commit 75317ae

Browse files
MKhalusovastevhliu
andauthored
[docs] Performance docs tidy up, part 1 (#23963)
* first pass at the single gpu doc * overview: improved clarity and navigation * WIP * updated intro and deepspeed sections * improved torch.compile section * more improvements * minor improvements * make style * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * feedback addressed * mdx -> md * link fix * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 54ba860 commit 75317ae

File tree

4 files changed

+606
-594
lines changed

4 files changed

+606
-594
lines changed

docs/source/en/_toctree.yml

Lines changed: 33 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -111,36 +111,40 @@
111111
- sections:
112112
- local: performance
113113
title: Overview
114-
- local: perf_train_gpu_one
115-
title: Training on one GPU
116-
- local: perf_train_gpu_many
117-
title: Training on many GPUs
118-
- local: perf_train_cpu
119-
title: Training on CPU
120-
- local: perf_train_cpu_many
121-
title: Training on many CPUs
122-
- local: perf_train_tpu
123-
title: Training on TPUs
124-
- local: perf_train_tpu_tf
125-
title: Training on TPU with TensorFlow
126-
- local: perf_train_special
127-
title: Training on Specialized Hardware
128-
- local: perf_infer_cpu
129-
title: Inference on CPU
130-
- local: perf_infer_gpu_one
131-
title: Inference on one GPU
132-
- local: perf_infer_gpu_many
133-
title: Inference on many GPUs
134-
- local: perf_infer_special
135-
title: Inference on Specialized Hardware
136-
- local: perf_hardware
137-
title: Custom hardware for training
114+
- sections:
115+
- local: perf_train_gpu_one
116+
title: Methods and tools for efficient training on a single GPU
117+
- local: perf_train_gpu_many
118+
title: Multiple GPUs and parallelism
119+
- local: perf_train_cpu
120+
title: Efficient training on CPU
121+
- local: perf_train_cpu_many
122+
title: Distributed CPU training
123+
- local: perf_train_tpu
124+
title: Training on TPUs
125+
- local: perf_train_tpu_tf
126+
title: Training on TPU with TensorFlow
127+
- local: perf_train_special
128+
title: Training on Specialized Hardware
129+
- local: perf_hardware
130+
title: Custom hardware for training
131+
- local: hpo_train
132+
title: Hyperparameter Search using Trainer API
133+
title: Efficient training techniques
134+
- sections:
135+
- local: perf_infer_cpu
136+
title: Inference on CPU
137+
- local: perf_infer_gpu_one
138+
title: Inference on one GPU
139+
- local: perf_infer_gpu_many
140+
title: Inference on many GPUs
141+
- local: perf_infer_special
142+
title: Inference on Specialized Hardware
143+
title: Optimizing inference
138144
- local: big_models
139145
title: Instantiating a big model
140146
- local: debugging
141-
title: Debugging
142-
- local: hpo_train
143-
title: Hyperparameter Search using Trainer API
147+
title: Troubleshooting
144148
- local: tf_xla
145149
title: XLA Integration for TensorFlow Models
146150
title: Performance and scalability
@@ -182,6 +186,8 @@
182186
title: Perplexity of fixed-length models
183187
- local: pipeline_webserver
184188
title: Pipelines for webserver inference
189+
- local: model_memory_anatomy
190+
title: Model training anatomy
185191
title: Conceptual guides
186192
- sections:
187193
- sections:
Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
<!---
2+
Copyright 2023 The HuggingFace Team. All rights reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# Model training anatomy
18+
19+
To understand performance optimization techniques that one can apply to improve efficiency of model training
20+
speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute
21+
intensity varies depending on an operation performed.
22+
23+
Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration,
24+
we'll need to install a few libraries:
25+
26+
```bash
27+
pip install transformers datasets accelerate nvidia-ml-py3
28+
```
29+
30+
The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar
31+
with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.
32+
33+
Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier.
34+
In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.
35+
36+
37+
```py
38+
>>> import numpy as np
39+
>>> from datasets import Dataset
40+
41+
42+
>>> seq_len, dataset_size = 512, 512
43+
>>> dummy_data = {
44+
... "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
45+
... "labels": np.random.randint(0, 1, (dataset_size)),
46+
... }
47+
>>> ds = Dataset.from_dict(dummy_data)
48+
>>> ds.set_format("pt")
49+
```
50+
51+
To print summary statistics for the GPU utilization and the training run with the [`Trainer`] we define two helper functions:
52+
53+
```py
54+
>>> from pynvml import *
55+
56+
57+
>>> def print_gpu_utilization():
58+
... nvmlInit()
59+
... handle = nvmlDeviceGetHandleByIndex(0)
60+
... info = nvmlDeviceGetMemoryInfo(handle)
61+
... print(f"GPU memory occupied: {info.used//1024**2} MB.")
62+
63+
64+
>>> def print_summary(result):
65+
... print(f"Time: {result.metrics['train_runtime']:.2f}")
66+
... print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
67+
... print_gpu_utilization()
68+
```
69+
70+
Let's verify that we start with a free GPU memory:
71+
72+
```py
73+
>>> print_gpu_utilization()
74+
GPU memory occupied: 0 MB.
75+
```
76+
77+
That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on
78+
your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by
79+
the user. When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory. To see how
80+
much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.
81+
82+
```py
83+
>>> import torch
84+
85+
86+
>>> torch.ones((1, 1)).to("cuda")
87+
>>> print_gpu_utilization()
88+
GPU memory occupied: 1343 MB.
89+
```
90+
91+
We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses.
92+
93+
## Load Model
94+
95+
First, we load the `bert-large-uncased` model. We load the model weights directly to the GPU so that we can check
96+
how much space just the weights use.
97+
98+
99+
```py
100+
>>> from transformers import AutoModelForSequenceClassification
101+
102+
103+
>>> model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
104+
>>> print_gpu_utilization()
105+
GPU memory occupied: 2631 MB.
106+
```
107+
108+
We can see that the model weights alone take up 1.3 GB of the GPU memory. The exact number depends on the specific
109+
GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an
110+
optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result
111+
as with `nvidia-smi` CLI:
112+
113+
114+
```bash
115+
nvidia-smi
116+
```
117+
118+
```bash
119+
Tue Jan 11 08:58:05 2022
120+
+-----------------------------------------------------------------------------+
121+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
122+
|-------------------------------+----------------------+----------------------+
123+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
124+
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
125+
| | | MIG M. |
126+
|===============================+======================+======================|
127+
| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 |
128+
| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default |
129+
| | | N/A |
130+
+-------------------------------+----------------------+----------------------+
131+
132+
+-----------------------------------------------------------------------------+
133+
| Processes: |
134+
| GPU GI CI PID Type Process name GPU Memory |
135+
| ID ID Usage |
136+
|=============================================================================|
137+
| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB |
138+
+-----------------------------------------------------------------------------+
139+
```
140+
141+
We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can
142+
start training the model and see how the GPU memory consumption changes. First, we set up a few standard training
143+
arguments:
144+
145+
```py
146+
default_args = {
147+
"output_dir": "tmp",
148+
"evaluation_strategy": "steps",
149+
"num_train_epochs": 1,
150+
"log_level": "error",
151+
"report_to": "none",
152+
}
153+
```
154+
155+
<Tip>
156+
157+
If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python
158+
kernel between experiments.
159+
160+
</Tip>
161+
162+
## Memory utilization at vanilla training
163+
164+
Let's use the [`Trainer`] and train the model without using any GPU performance optimization techniques and a batch size of 4:
165+
166+
```py
167+
>>> from transformers import TrainingArguments, Trainer, logging
168+
169+
>>> logging.set_verbosity_error()
170+
171+
172+
>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
173+
>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds)
174+
>>> result = trainer.train()
175+
>>> print_summary(result)
176+
```
177+
178+
```
179+
Time: 57.82
180+
Samples/second: 8.86
181+
GPU memory occupied: 14949 MB.
182+
```
183+
184+
We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size
185+
can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our
186+
model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model.
187+
To understand a bit better why this is the case let's have look at a model's operations and memory needs.
188+
189+
## Anatomy of Model's Operations
190+
191+
Transformers architecture includes 3 main groups of operations grouped below by compute-intensity.
192+
193+
1. **Tensor Contractions**
194+
195+
Linear layers and components of Multi-Head Attention all do batched **matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer.
196+
197+
2. **Statistical Normalizations**
198+
199+
Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more **reduction operations**, the result of which is then applied via a map.
200+
201+
3. **Element-wise Operators**
202+
203+
These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations.
204+
205+
This knowledge can be helpful to know when analyzing performance bottlenecks.
206+
207+
This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072)
208+
209+
210+
## Anatomy of Model's Memory
211+
212+
We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there
213+
are many components during training that use GPU memory. The components on GPU memory are the following:
214+
215+
1. model weights
216+
2. optimizer states
217+
3. gradients
218+
4. forward activations saved for gradient computation
219+
5. temporary buffers
220+
6. functionality-specific memory
221+
222+
A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For
223+
inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per
224+
model parameter for mixed precision inference, plus activation memory.
225+
226+
Let's look at the details.
227+
228+
**Model Weights:**
229+
230+
- 4 bytes * number of parameters for fp32 training
231+
- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory)
232+
233+
**Optimizer States:**
234+
235+
- 8 bytes * number of parameters for normal AdamW (maintains 2 states)
236+
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
237+
- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
238+
239+
**Gradients**
240+
241+
- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)
242+
243+
**Forward Activations**
244+
245+
- size depends on many factors, the key ones being sequence length, hidden size and batch size.
246+
247+
There are the input and output that are being passed and returned by the forward and the backward functions and the
248+
forward activations saved for gradient computation.
249+
250+
**Temporary Memory**
251+
252+
Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the
253+
moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think
254+
strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed.
255+
256+
**Functionality-specific memory**
257+
258+
Then, your software could have special memory needs. For example, when generating text using beam search, the software
259+
needs to maintain multiple copies of inputs and outputs.
260+
261+
**`forward` vs `backward` Execution Speed**
262+
263+
For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates
264+
into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually
265+
bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward
266+
(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward,
267+
and writes once, gradInput).
268+
269+
As you can see, there are potentially a few places where we could save GPU memory or speed up operations.
270+
Now that you understand what affects GPU utilization and computation speed, refer to
271+
the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about
272+
performance optimization techniques.

0 commit comments

Comments
 (0)