|
| 1 | +<!--- |
| 2 | +Copyright 2023 The HuggingFace Team. All rights reserved. |
| 3 | +
|
| 4 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | +you may not use this file except in compliance with the License. |
| 6 | +You may obtain a copy of the License at |
| 7 | +
|
| 8 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +
|
| 10 | +Unless required by applicable law or agreed to in writing, software |
| 11 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | +See the License for the specific language governing permissions and |
| 14 | +limitations under the License. |
| 15 | +--> |
| 16 | + |
| 17 | +# Model training anatomy |
| 18 | + |
| 19 | +To understand performance optimization techniques that one can apply to improve efficiency of model training |
| 20 | +speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute |
| 21 | +intensity varies depending on an operation performed. |
| 22 | + |
| 23 | +Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, |
| 24 | +we'll need to install a few libraries: |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install transformers datasets accelerate nvidia-ml-py3 |
| 28 | +``` |
| 29 | + |
| 30 | +The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar |
| 31 | +with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly. |
| 32 | + |
| 33 | +Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. |
| 34 | +In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format. |
| 35 | + |
| 36 | + |
| 37 | +```py |
| 38 | +>>> import numpy as np |
| 39 | +>>> from datasets import Dataset |
| 40 | + |
| 41 | + |
| 42 | +>>> seq_len, dataset_size = 512, 512 |
| 43 | +>>> dummy_data = { |
| 44 | +... "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), |
| 45 | +... "labels": np.random.randint(0, 1, (dataset_size)), |
| 46 | +... } |
| 47 | +>>> ds = Dataset.from_dict(dummy_data) |
| 48 | +>>> ds.set_format("pt") |
| 49 | +``` |
| 50 | + |
| 51 | +To print summary statistics for the GPU utilization and the training run with the [`Trainer`] we define two helper functions: |
| 52 | + |
| 53 | +```py |
| 54 | +>>> from pynvml import * |
| 55 | + |
| 56 | + |
| 57 | +>>> def print_gpu_utilization(): |
| 58 | +... nvmlInit() |
| 59 | +... handle = nvmlDeviceGetHandleByIndex(0) |
| 60 | +... info = nvmlDeviceGetMemoryInfo(handle) |
| 61 | +... print(f"GPU memory occupied: {info.used//1024**2} MB.") |
| 62 | + |
| 63 | + |
| 64 | +>>> def print_summary(result): |
| 65 | +... print(f"Time: {result.metrics['train_runtime']:.2f}") |
| 66 | +... print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") |
| 67 | +... print_gpu_utilization() |
| 68 | +``` |
| 69 | + |
| 70 | +Let's verify that we start with a free GPU memory: |
| 71 | + |
| 72 | +```py |
| 73 | +>>> print_gpu_utilization() |
| 74 | +GPU memory occupied: 0 MB. |
| 75 | +``` |
| 76 | + |
| 77 | +That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on |
| 78 | +your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by |
| 79 | +the user. When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory. To see how |
| 80 | +much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. |
| 81 | + |
| 82 | +```py |
| 83 | +>>> import torch |
| 84 | + |
| 85 | + |
| 86 | +>>> torch.ones((1, 1)).to("cuda") |
| 87 | +>>> print_gpu_utilization() |
| 88 | +GPU memory occupied: 1343 MB. |
| 89 | +``` |
| 90 | + |
| 91 | +We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses. |
| 92 | + |
| 93 | +## Load Model |
| 94 | + |
| 95 | +First, we load the `bert-large-uncased` model. We load the model weights directly to the GPU so that we can check |
| 96 | +how much space just the weights use. |
| 97 | + |
| 98 | + |
| 99 | +```py |
| 100 | +>>> from transformers import AutoModelForSequenceClassification |
| 101 | + |
| 102 | + |
| 103 | +>>> model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda") |
| 104 | +>>> print_gpu_utilization() |
| 105 | +GPU memory occupied: 2631 MB. |
| 106 | +``` |
| 107 | + |
| 108 | +We can see that the model weights alone take up 1.3 GB of the GPU memory. The exact number depends on the specific |
| 109 | +GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an |
| 110 | +optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result |
| 111 | +as with `nvidia-smi` CLI: |
| 112 | + |
| 113 | + |
| 114 | +```bash |
| 115 | +nvidia-smi |
| 116 | +``` |
| 117 | + |
| 118 | +```bash |
| 119 | +Tue Jan 11 08:58:05 2022 |
| 120 | ++-----------------------------------------------------------------------------+ |
| 121 | +| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |
| 122 | +|-------------------------------+----------------------+----------------------+ |
| 123 | +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | |
| 124 | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |
| 125 | +| | | MIG M. | |
| 126 | +|===============================+======================+======================| |
| 127 | +| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | |
| 128 | +| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | |
| 129 | +| | | N/A | |
| 130 | ++-------------------------------+----------------------+----------------------+ |
| 131 | + |
| 132 | ++-----------------------------------------------------------------------------+ |
| 133 | +| Processes: | |
| 134 | +| GPU GI CI PID Type Process name GPU Memory | |
| 135 | +| ID ID Usage | |
| 136 | +|=============================================================================| |
| 137 | +| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB | |
| 138 | ++-----------------------------------------------------------------------------+ |
| 139 | +``` |
| 140 | + |
| 141 | +We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can |
| 142 | +start training the model and see how the GPU memory consumption changes. First, we set up a few standard training |
| 143 | +arguments: |
| 144 | + |
| 145 | +```py |
| 146 | +default_args = { |
| 147 | + "output_dir": "tmp", |
| 148 | + "evaluation_strategy": "steps", |
| 149 | + "num_train_epochs": 1, |
| 150 | + "log_level": "error", |
| 151 | + "report_to": "none", |
| 152 | +} |
| 153 | +``` |
| 154 | + |
| 155 | +<Tip> |
| 156 | + |
| 157 | + If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python |
| 158 | + kernel between experiments. |
| 159 | + |
| 160 | +</Tip> |
| 161 | + |
| 162 | +## Memory utilization at vanilla training |
| 163 | + |
| 164 | +Let's use the [`Trainer`] and train the model without using any GPU performance optimization techniques and a batch size of 4: |
| 165 | + |
| 166 | +```py |
| 167 | +>>> from transformers import TrainingArguments, Trainer, logging |
| 168 | + |
| 169 | +>>> logging.set_verbosity_error() |
| 170 | + |
| 171 | + |
| 172 | +>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) |
| 173 | +>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds) |
| 174 | +>>> result = trainer.train() |
| 175 | +>>> print_summary(result) |
| 176 | +``` |
| 177 | + |
| 178 | +``` |
| 179 | +Time: 57.82 |
| 180 | +Samples/second: 8.86 |
| 181 | +GPU memory occupied: 14949 MB. |
| 182 | +``` |
| 183 | + |
| 184 | +We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size |
| 185 | +can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our |
| 186 | +model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. |
| 187 | +To understand a bit better why this is the case let's have look at a model's operations and memory needs. |
| 188 | + |
| 189 | +## Anatomy of Model's Operations |
| 190 | + |
| 191 | +Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. |
| 192 | + |
| 193 | +1. **Tensor Contractions** |
| 194 | + |
| 195 | + Linear layers and components of Multi-Head Attention all do batched **matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer. |
| 196 | + |
| 197 | +2. **Statistical Normalizations** |
| 198 | + |
| 199 | + Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more **reduction operations**, the result of which is then applied via a map. |
| 200 | + |
| 201 | +3. **Element-wise Operators** |
| 202 | + |
| 203 | + These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations. |
| 204 | + |
| 205 | +This knowledge can be helpful to know when analyzing performance bottlenecks. |
| 206 | + |
| 207 | +This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072) |
| 208 | + |
| 209 | + |
| 210 | +## Anatomy of Model's Memory |
| 211 | + |
| 212 | +We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there |
| 213 | +are many components during training that use GPU memory. The components on GPU memory are the following: |
| 214 | + |
| 215 | +1. model weights |
| 216 | +2. optimizer states |
| 217 | +3. gradients |
| 218 | +4. forward activations saved for gradient computation |
| 219 | +5. temporary buffers |
| 220 | +6. functionality-specific memory |
| 221 | + |
| 222 | +A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For |
| 223 | +inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per |
| 224 | +model parameter for mixed precision inference, plus activation memory. |
| 225 | + |
| 226 | +Let's look at the details. |
| 227 | + |
| 228 | +**Model Weights:** |
| 229 | + |
| 230 | +- 4 bytes * number of parameters for fp32 training |
| 231 | +- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) |
| 232 | + |
| 233 | +**Optimizer States:** |
| 234 | + |
| 235 | +- 8 bytes * number of parameters for normal AdamW (maintains 2 states) |
| 236 | +- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) |
| 237 | +- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) |
| 238 | + |
| 239 | +**Gradients** |
| 240 | + |
| 241 | +- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32) |
| 242 | + |
| 243 | +**Forward Activations** |
| 244 | + |
| 245 | +- size depends on many factors, the key ones being sequence length, hidden size and batch size. |
| 246 | + |
| 247 | +There are the input and output that are being passed and returned by the forward and the backward functions and the |
| 248 | +forward activations saved for gradient computation. |
| 249 | + |
| 250 | +**Temporary Memory** |
| 251 | + |
| 252 | +Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the |
| 253 | +moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think |
| 254 | +strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. |
| 255 | + |
| 256 | +**Functionality-specific memory** |
| 257 | + |
| 258 | +Then, your software could have special memory needs. For example, when generating text using beam search, the software |
| 259 | +needs to maintain multiple copies of inputs and outputs. |
| 260 | + |
| 261 | +**`forward` vs `backward` Execution Speed** |
| 262 | + |
| 263 | +For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates |
| 264 | +into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually |
| 265 | +bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward |
| 266 | +(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, |
| 267 | +and writes once, gradInput). |
| 268 | + |
| 269 | +As you can see, there are potentially a few places where we could save GPU memory or speed up operations. |
| 270 | +Now that you understand what affects GPU utilization and computation speed, refer to |
| 271 | +the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about |
| 272 | +performance optimization techniques. |
0 commit comments