You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/deepspeed.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -16,11 +16,11 @@ rendered properly in your Markdown viewer.
16
16
17
17
# DeepSpeed
18
18
19
-
[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it's core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages:
19
+
[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At its core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages:
20
20
21
-
* ZeRO-1, optimizer state partioning across GPUs
21
+
* ZeRO-1, optimizer state partitioning across GPUs
22
22
* ZeRO-2, gradient partitioning across GPUs
23
-
* ZeRO-3, parameteter partitioning across GPUs
23
+
* ZeRO-3, parameter partitioning across GPUs
24
24
25
25
In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers [`Trainer`] class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models.
26
26
@@ -159,7 +159,7 @@ There are three types of configuration parameters:
159
159
160
160
You could also modify the DeepSpeed configuration and edit [`TrainingArguments`] from it:
161
161
162
-
1. Create or load a DeepSpeed configuration to used as the main configuration
162
+
1. Create or load a DeepSpeed configuration to use as the main configuration
163
163
2. Create a [`TrainingArguments`] object based on these DeepSpeed configuration values
164
164
165
165
Some values, such as `scheduler.params.total_num_steps` are calculated by the [`Trainer`] during training.
@@ -191,7 +191,7 @@ ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed
191
191
</hfoption>
192
192
<hfoptionid="ZeRO-2">
193
193
194
-
ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since it's features are not relevant to inference. Some important parameters to configure for better performance include:
194
+
ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since its features are not relevant to inference. Some important parameters to configure for better performance include:
195
195
196
196
*`offload_optimizer` should be enabled to reduce GPU memory usage.
197
197
*`overlap_comm` when set to `true` trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the `allgather_bucket_size` and `reduce_bucket_size` values. In this example, they're set to `5e8` which means it requires 9GB of GPU memory. If your GPU memory is 8GB or less, you should reduce `overlap_comm` to lower the memory requirements and prevent an out-of-memory (OOM) error.
@@ -226,7 +226,7 @@ ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. Unlike ZeRO-2
226
226
*`pin_memory: true` can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory.
227
227
*`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given time. Reduce this value if you encounter an OOM error.
228
228
*`stage3_max_reuse_distance` is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than `stage3_max_reuse_distance`), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. But reduce this value if you encounter an OOM error.
229
-
*`stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is an expensive in terms of memory and speed. You should enable it if you're planning on resuming training.
229
+
*`stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is expensive in terms of memory and speed. You should enable it if you're planning on resuming training.
230
230
*`sub_group_size` controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of `sub_group_size` and each bucket is updated one at a time. When used with NVMe offload, `sub_group_size` determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. `sub_group_size` can be left to its default value if you aren't using NVMe offload, but you may want to change it if you:
231
231
232
232
1. Run into an OOM error during the optimizer step. In this case, reduce `sub_group_size` to reduce memory usage of the temporary buffers.
Copy file name to clipboardExpand all lines: docs/source/en/llm_tutorial_optimization.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -147,7 +147,7 @@ Let's call it now for the next experiment.
147
147
```python
148
148
flush()
149
149
```
150
-
In the recent version of the accelerate library, you can also use an utility method called `release_memory()`
150
+
In the recent version of the accelerate library, you can also use a utility method called `release_memory()`
151
151
152
152
```python
153
153
from accelerate.utils import release_memory
@@ -683,7 +683,7 @@ Assistant: Germany has ca. 81 million inhabitants
683
683
684
684
In this chat, the LLM runs auto-regressive decoding twice:
685
685
1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
686
-
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
686
+
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
687
687
688
688
Two things should be noted here:
689
689
1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
Copy file name to clipboardExpand all lines: docs/source/en/perf_hardware.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -116,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid
116
116
117
117
So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture.
118
118
119
-
Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext.
119
+
Let's compare the execution of an openai-community/gpt2 language model training over a small sample of wikitext.
0 commit comments