Skip to content

Commit e5ca9b0

Browse files
authored
Fix typos (huggingface#31819)
* fix typo * fix typo * fix typos * fix typo * fix typos
1 parent f471184 commit e5ca9b0

File tree

5 files changed

+12
-12
lines changed

5 files changed

+12
-12
lines changed

docs/source/en/deepspeed.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@ rendered properly in your Markdown viewer.
1616

1717
# DeepSpeed
1818

19-
[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it's core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages:
19+
[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At its core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages:
2020

21-
* ZeRO-1, optimizer state partioning across GPUs
21+
* ZeRO-1, optimizer state partitioning across GPUs
2222
* ZeRO-2, gradient partitioning across GPUs
23-
* ZeRO-3, parameteter partitioning across GPUs
23+
* ZeRO-3, parameter partitioning across GPUs
2424

2525
In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers [`Trainer`] class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models.
2626

@@ -159,7 +159,7 @@ There are three types of configuration parameters:
159159

160160
You could also modify the DeepSpeed configuration and edit [`TrainingArguments`] from it:
161161

162-
1. Create or load a DeepSpeed configuration to used as the main configuration
162+
1. Create or load a DeepSpeed configuration to use as the main configuration
163163
2. Create a [`TrainingArguments`] object based on these DeepSpeed configuration values
164164

165165
Some values, such as `scheduler.params.total_num_steps` are calculated by the [`Trainer`] during training.
@@ -191,7 +191,7 @@ ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed
191191
</hfoption>
192192
<hfoption id="ZeRO-2">
193193

194-
ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since it's features are not relevant to inference. Some important parameters to configure for better performance include:
194+
ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since its features are not relevant to inference. Some important parameters to configure for better performance include:
195195

196196
* `offload_optimizer` should be enabled to reduce GPU memory usage.
197197
* `overlap_comm` when set to `true` trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the `allgather_bucket_size` and `reduce_bucket_size` values. In this example, they're set to `5e8` which means it requires 9GB of GPU memory. If your GPU memory is 8GB or less, you should reduce `overlap_comm` to lower the memory requirements and prevent an out-of-memory (OOM) error.
@@ -226,7 +226,7 @@ ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. Unlike ZeRO-2
226226
* `pin_memory: true` can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory.
227227
* `stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given time. Reduce this value if you encounter an OOM error.
228228
* `stage3_max_reuse_distance` is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than `stage3_max_reuse_distance`), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. But reduce this value if you encounter an OOM error.
229-
* `stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is an expensive in terms of memory and speed. You should enable it if you're planning on resuming training.
229+
* `stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is expensive in terms of memory and speed. You should enable it if you're planning on resuming training.
230230
* `sub_group_size` controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of `sub_group_size` and each bucket is updated one at a time. When used with NVMe offload, `sub_group_size` determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. `sub_group_size` can be left to its default value if you aren't using NVMe offload, but you may want to change it if you:
231231

232232
1. Run into an OOM error during the optimizer step. In this case, reduce `sub_group_size` to reduce memory usage of the temporary buffers.

docs/source/en/glossary.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ reading the whole sentence with a mask to hide future tokens at a certain timest
139139

140140
### deep learning (DL)
141141

142-
Machine learning algorithms which uses neural networks with several layers.
142+
Machine learning algorithms which use neural networks with several layers.
143143

144144
## E
145145

@@ -519,4 +519,4 @@ A form of model training in which data provided to the model is not labeled. Uns
519519
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
520520
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
521521
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
522-
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
522+
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).

docs/source/en/llm_tutorial_optimization.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ Let's call it now for the next experiment.
147147
```python
148148
flush()
149149
```
150-
In the recent version of the accelerate library, you can also use an utility method called `release_memory()`
150+
In the recent version of the accelerate library, you can also use a utility method called `release_memory()`
151151

152152
```python
153153
from accelerate.utils import release_memory
@@ -683,7 +683,7 @@ Assistant: Germany has ca. 81 million inhabitants
683683

684684
In this chat, the LLM runs auto-regressive decoding twice:
685685
1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
686-
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
686+
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
687687

688688
Two things should be noted here:
689689
1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.

docs/source/en/perf_hardware.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid
116116
117117
So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture.
118118

119-
Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext.
119+
Let's compare the execution of an openai-community/gpt2 language model training over a small sample of wikitext.
120120

121121
The results are:
122122

utils/diff_model_converter.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -497,7 +497,7 @@ def leave_ClassDef(self, original_node, updated_node):
497497
start_insert_idx -= 1
498498
self.new_body[dependency] = {"insert_idx": start_insert_idx, "node": node}
499499
elif dependency not in self.inserted_deps:
500-
# make sure the node is written after it's dependencies
500+
# make sure the node is written after its dependencies
501501
start_insert_idx = self.new_body[dependency]["insert_idx"] - 1
502502
self.inserted_deps.append(dependency)
503503
if len(list_dependencies) > 0:

0 commit comments

Comments
 (0)