Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge changes from deepspeed master #24

Merged
merged 31 commits into from
Apr 6, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
18a26f3
[WarmupDecayLR] fix log(0) & 1/log(1) bugs (#772)
stas00 Mar 12, 2021
35fd7cc
bump to v0.3.12
jeffra Mar 12, 2021
458ff02
Bug fix: Remove client optimizer param_group list item that does not …
cli99 Mar 12, 2021
73d762c
[doc] pipeline doc typos/improvements (#659)
stas00 Mar 14, 2021
4601885
Samyamr/inference hook fix (#851)
samyam Mar 15, 2021
a75d971
ZeRO Stage 2: Clear reduced gradients (#856)
tjruwase Mar 15, 2021
24335d4
[runner/launch] propagate the error (#854)
stas00 Mar 16, 2021
547d1c5
docs: minor spelling tweaks (#858)
brettkoonce Mar 16, 2021
871f304
Allow args to be optional in deepspeed.initialize (#825)
jeffra Mar 16, 2021
fa87a73
Fix ZeRO3 save_checkpoint (#857)
tjruwase Mar 16, 2021
7bcd72a
Make config objects json serializable (#862)
tjruwase Mar 16, 2021
12a53b4
bump version 0.3.13
jeffra Mar 16, 2021
68c8481
1-bit Adam v2 (#817)
conglongli Mar 16, 2021
10c0bea
consistent checkpoint filenaming (#865)
stas00 Mar 18, 2021
9e9f8cb
[doc] launcher (#868)
stas00 Mar 18, 2021
22d5a1f
[doc] pipeline (#888)
stas00 Mar 24, 2021
7f03282
[debug utils] see_memory_usage fixes (#890)
stas00 Mar 25, 2021
7531c6b
full fp32 weights reconstruction for zero 2+3 (#892)
stas00 Mar 26, 2021
39013dd
save_fp16_model consolidated for zero3 (#893)
stas00 Mar 27, 2021
7fcc891
Fix zero stage2 cpu_offload when some model trainable parameters skip…
ghosthamlet Mar 27, 2021
af2d8fc
update kramdown (#901)
jeffra Mar 30, 2021
23ff6cb
update backward api doc (#903)
jeffra Mar 30, 2021
c042264
Bump kramdown from 2.3.0 to 2.3.1 in /docs (#905)
dependabot[bot] Mar 30, 2021
8c9e16e
We're hiring! + integration posts
jeffra Mar 31, 2021
c6b497d
[website] We're hiring! + integration posts
jeffra Mar 31, 2021
c814abd
[website] we're hiring!
jeffra Mar 31, 2021
5d721e0
zero.Init() clarification (#880)
stas00 Apr 1, 2021
8db4fdf
disable pipe test (#915)
jeffra Apr 2, 2021
ab5534f
Add link to AML examples. (#916)
awan-10 Apr 2, 2021
c574788
Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …
Apr 6, 2021
b58a8fa
Merge branch 'microsoft-master' into stella
Apr 6, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: minor spelling tweaks (microsoft#858)
  • Loading branch information
brettkoonce authored Mar 16, 2021
commit 547d1c5f8f3f5a3a01bb13b286e7686c5364f9b8
2 changes: 1 addition & 1 deletion deepspeed/profiling/flops_profiler/profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ def del_extra_repr(module):
"Each module profile is listed after its name in the following order: \nnumber of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency)."
)
print(
"Note: \n1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.\n2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.\n"
"Note: \n1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.\n2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.\n"
)
print(self.model)

Expand Down
6 changes: 3 additions & 3 deletions docs/_pages/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ and communication- efficient training. DeepSpeed supports a hybrid
combination of data, model, and pipeline parallelism and has scaled to over
[one trillion parameters using 3D parallelism]({{ site.press_release_v3 }}).
Pipeline parallelism can also improve communication efficiency and has
accelerated training by up to 7x on low-banwdith clusters.
accelerated training by up to 7x on low-bandwidth clusters.


## Model Parallelism
Expand Down Expand Up @@ -256,9 +256,9 @@ This can be enabled by setting the following in the `deepspeed_config` file.

```

### Timing Activiation Checkpoint Functions
### Timing Activation Checkpoint Functions

When activiation checkpoingint is enabled, profiling the forward and backward time of each checkpoint function can be enabled in the `deepspeed_config` file.
When activation checkpointing is enabled, profiling the forward and backward time of each checkpoint function can be enabled in the `deepspeed_config` file.

```json
{
Expand Down
18 changes: 9 additions & 9 deletions docs/_tutorials/flops-profiler.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ Top 3 modules in params at depth 2 are {'Conv2d': '50.69 k', 'Linear': '11.01 k'
Top 3 modules in latency at depth 2 are {'Conv2d': '11.37 ms', 'Linear': '5.27 ms', 'AvgPool2d': '5.02 ms'}

------------------------------ Detailed Profile ------------------------------
Each module profile is listed after its name in the follwing order:
Each module profile is listed after its name in the following order:
number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
Note:
1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

LeNet5(
61.71 k, 100.00% Params, 439.56 MMACs, 100.00% MACs, 25.7 ms, 100.00% latency, 34.2 GFLOPS,
Expand Down Expand Up @@ -92,7 +92,7 @@ The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a stan

### Usage With the DeepSpeed Runtime

When using DeepSpeed for model training, the flops profiler can be configured in the `deepspeed_config` file. No explict API calls are needed to use the profiler. Refer to [flops profiler](https://www.deepspeed.ai/docs/config-json/#flops-profiler) for details.
When using DeepSpeed for model training, the flops profiler can be configured in the `deepspeed_config` file. No explicit API calls are needed to use the profiler. Refer to [flops profiler](https://www.deepspeed.ai/docs/config-json/#flops-profiler) for details.


#### Example: Megatron-LM
Expand Down Expand Up @@ -131,11 +131,11 @@ Top 3 modules in params at depth 8 are {'ColumnParallelLinear': '7.35 M', 'RowPa
Top 3 modules in latency at depth 8 are {'ColumnParallelLinear': '659.23 us', 'RowParallelLinear': '587.94 us', 'FusedScaleMaskSoftmax': '370.98 us'}

------------------------------ Detailed Profile ------------------------------
Each module profile is listed after its name in the follwing order:
Each module profile is listed after its name in the following order:
number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
Note:
1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

DistributedDataParallel(
38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.81 ms, 100.00% latency, 18.61 TFLOPS,
Expand Down Expand Up @@ -235,11 +235,11 @@ Top 3 modules in params at depth 2 are {'Linear': '58.63 M', 'Conv2d': '2.47 M',
Top 3 modules in latency at depth 2 are {'Conv2d': '13.96 ms', 'Linear': '6.23 ms', 'ReLU': '730.75 us'}

------------------------------ Detailed Profile ------------------------------
Each module profile is listed after its name in the follwing order:
Each module profile is listed after its name in the following order:
number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
Note:
1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

AlexNet(
61.1 M, 100.00% Params, 183.18 GMACs, 100.00% MACs, 22.13 ms, 100.00% latency, 16.56 TFLOPS,
Expand Down Expand Up @@ -335,11 +335,11 @@ Top 3 modules in params at depth 7 are {'Linear': '28.35 M', 'LayerNorm': '18.43
Top 3 modules in latency at depth 7 are {'Linear': '153.7 ms', 'LayerNorm': '4.74 ms', 'Dropout': '597.95 us'}

------------------------------ Detailed Profile ------------------------------
Each module profile is listed after its name in the follwing order:
Each module profile is listed after its name in the following order:
number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
Note:
1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.

BertForSequenceClassification(
109.48 M, 100.00% Params, 43.5 GMACs, 100.00% MACs, 393.7 ms, 100.00% latency, 220.97 GFLOPS,
Expand Down
2 changes: 1 addition & 1 deletion docs/_tutorials/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ pipeline. Each worker should load micro-batches of size
a total of `engine.gradient_accumulation_steps()` times per `train_batch()`.

**Watch out!**
The pipeline engine *pulls* data from an iteratior instead of iterating over
The pipeline engine *pulls* data from an iterator instead of iterating over
it. It's critical that the data stream does not empty in the middle of a
training batch. Each invocation of `train_batch()` will pull
a total of `engine.gradient_accumulation_steps()` micro-batches of data from
Expand Down
2 changes: 1 addition & 1 deletion docs/_tutorials/sparse-attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ This module, is the parent class for all sparsity structures and contains the sh
* `block`: an integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such square blocks; `Block X Block`.
* `different_layout_per_head`: a boolean determining if each head should be assigned a different sparsity layout; default is false and this will be satisfied based on availability.

* **Fixed** (FixedSparistyConfig):
* **Fixed** (FixedSparsityConfig):
This structure is based on [Generative Modeling with Sparse Transformers](https://arxiv.org/abs/1904.10509) from OpenAI, in which local and global attention is fixed by the given parameters:
* `num_local_blocks`: an integer determining the number of blocks in local attention window. As it is illustrated in the below figure (adapted from original paper), tokens in a local window, attend to all tokens local to them. In the case of autoregressive model, as in the figure, tokens attend to tokens appearing before them in the local window. And in the case of Masked model such as BERT, attention is bidirectional.
* `num_global_blocks`: an integer determining how many consecutive blocks in a local window is used as the representative of the window for global attention; illustrated in the figure below as well.
Expand Down
4 changes: 2 additions & 2 deletions docs/_tutorials/zero.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: "Zero Redundancy Optimizer (ZeRO)"
---
If you have not done so already, we advise that you read the DeepSpeed tutorials on [Getting Started](/getting-started/) and [Megatron-LM GPT-2](/tutorials/megatron/) before stepping through this tutorial.

In this tutorial, we will apply the ZeRO optimizer to the [Megatron-LM GPT-2](https://github.com/NVIDIA/Megatron-LM) model. ZeRO is a powerful set of memory optimization techniques that enable effective FP16 training of large models with trillons of parameters, such as [GPT-2](https://openai.com/blog/better-language-models/) and [Turing-NLG 17B](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/). Compared to the alternative model parallelism approaches for training large models, a key appeal of ZeRO is that no model code modifications are required. As this tutorial will demonstrate, *using ZeRO in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration JSON*. No code changes are needed.
In this tutorial, we will apply the ZeRO optimizer to the [Megatron-LM GPT-2](https://github.com/NVIDIA/Megatron-LM) model. ZeRO is a powerful set of memory optimization techniques that enable effective FP16 training of large models with trillions of parameters, such as [GPT-2](https://openai.com/blog/better-language-models/) and [Turing-NLG 17B](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/). Compared to the alternative model parallelism approaches for training large models, a key appeal of ZeRO is that no model code modifications are required. As this tutorial will demonstrate, *using ZeRO in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration JSON*. No code changes are needed.

## ZeRO Overview
ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware. Concretely, ZeRO is being implemented as incremental stages of optimizations, where optimizations in earlier stages are available in the later stages. To deep dive into ZeRO, please see our [paper](https://arxiv.org/abs/1910.02054v3).
Expand Down Expand Up @@ -226,7 +226,7 @@ class ParallelTransformerLayer(MegatronModule):

#### Allocating Massive Megatron-LM Models

We make two further changes to model initalization in order to support models
We make two further changes to model initialization in order to support models
that exceed *local* system memory, but not *total* system memory.

1. Allocate the model in a memory-scalable fashion. The model parameters will
Expand Down