Skip to content

Commit ccbc918

Browse files
authored
Merge pull request #4 from hpcaitech/hotfix/doc
reoder parallelization methods in parallelization documentation
2 parents 3c7604b + 50982c0 commit ccbc918

File tree

1 file changed

+49
-49
lines changed

1 file changed

+49
-49
lines changed

docs/parallelization.md

Lines changed: 49 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
2929
adds the distributed data sampler to the dataloader to shard the dataset.
3030

3131

32+
## 1D, 2D, 2.5D and 3D Parallel
33+
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
34+
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
35+
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
36+
37+
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
38+
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
39+
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
40+
devices where N is the number of tensor chunks in a single dimension.
41+
42+
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
43+
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
44+
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
45+
where each layer performs matrix multiplication operations independently with a dimension N.
46+
47+
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
48+
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
49+
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
50+
through optimized load balancing of parameters as well as activations.
51+
52+
53+
54+
```python
55+
# 1D parallel
56+
parallel = dict(
57+
pipeline=dict(size=1), # number of pipeline stages
58+
tensor=dict(size=4, mode='1d')
59+
)
60+
61+
# 2D parallel
62+
parallel = dict(
63+
pipeline=dict(size=1), # number of pipeline stages
64+
tensor=dict(size=4, mode='2d')
65+
)
66+
67+
# 2.5D parallel
68+
parallel = dict(
69+
pipeline=dict(size=1), # number of pipeline stages
70+
tensor=dict(size=8, mode='2.5d', depth=2)
71+
)
72+
73+
# 3D parallel
74+
parallel = dict(
75+
pipeline=dict(size=1), # number of pipeline stages
76+
tensor=dict(size=8, mode='3d')
77+
)
78+
```
79+
3280
## Pipeline Parallel (experimental)
3381

3482
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
@@ -160,57 +208,9 @@ schedule = dict(
160208
)
161209
```
162210

163-
## 1D, 2D, 2.5D and 3D Parallel
164-
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
165-
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
166-
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
167-
168-
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
169-
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
170-
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
171-
devices where N is the number of tensor chunks in a single dimension.
172-
173-
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
174-
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
175-
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
176-
where each layer performs matrix multiplication operations independently with a dimension N.
177-
178-
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
179-
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
180-
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
181-
through optimized load balancing of parameters as well as activations.
182-
183-
184-
185-
```python
186-
# 1D parallel
187-
parallel = dict(
188-
pipeline=dict(size=1), # number of pipeline stages
189-
tensor=dict(size=4, mode='1d')
190-
)
191-
192-
# 2D parallel
193-
parallel = dict(
194-
pipeline=dict(size=1), # number of pipeline stages
195-
tensor=dict(size=4, mode='2d')
196-
)
197-
198-
# 2.5D parallel
199-
parallel = dict(
200-
pipeline=dict(size=1), # number of pipeline stages
201-
tensor=dict(size=8, mode='2.5d', depth=2)
202-
)
203-
204-
# 3D parallel
205-
parallel = dict(
206-
pipeline=dict(size=1), # number of pipeline stages
207-
tensor=dict(size=8, mode='3d')
208-
)
209-
```
210-
211211

212212
## Sequence Parallel (experimental)
213213

214214
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
215215
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
216-
This feature is still in development is only experimental for now.
216+
This feature is still in development is only experimental for now.

0 commit comments

Comments
 (0)